{"id":31705,"date":"2025-01-28T00:29:00","date_gmt":"2025-01-28T00:29:00","guid":{"rendered":"https:\/\/www.duck9.com\/blog\/?p=31705"},"modified":"2025-01-28T00:51:10","modified_gmt":"2025-01-28T04:51:10","slug":"auditing-deep-seek-ai","status":"publish","type":"post","link":"https:\/\/www.duck9.com\/blog\/auditing-deep-seek-ai\/","title":{"rendered":"Auditing Deep Seek AI"},"content":{"rendered":"<div class=\"postie-post\">\n<div>\n<div dir=\"ltr\"><img decoding=\"async\" alt=\"image3.jpeg\" src=\"https:\/\/www.duck9.com\/wp-content\/uploads\/2025\/01\/image3-19.jpeg\"><a href=\"https:\/\/x.com\/coryklippsten\/status\/1883933328279232866?s=43&amp;t=NipKy21fekvPoZS5MA8-lQ\"><\/p>\n<table cellpadding=\"0\" cellspacing=\"0\" border=\"0\" style=\"border:1px solid #ccd6dd; border-radius: 12px;\" width=\"500\" bgcolor=\"#ffffff\">\n<tbody>\n<tr>\n<td colspan=\"3\" style=\"font-size: 0px; line-height: 0px;\" height=\"12\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td width=\"18\" style=\"font-size: 0px; line-height: 0px; min-width: 18px;\">&nbsp;<\/td>\n<td>\n<table cellpadding=\"0\" cellspacing=\"0\" border=\"0\" width=\"464\" align=\"left\">\n<tbody>\n<tr valign=\"top\">\n<td width=\"48\" valign=\"top\"><a href=\"https:\/\/x.com\/coryklippsten?s=43\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/pbs.twimg.com\/profile_images\/1778426409763082240\/7ZliIwoU_normal.jpg\" style=\"border-radius: 50%; padding: 0px;\" height=\"48\" width=\"48\" data-unique-identifier=\"\"><\/a><\/td>\n<td width=\"8\" style=\"font-size: 0px; line-height: 0px; min-width:8px;\"><img decoding=\"async\" src=\"https:\/\/ea.twimg.com\/email\/self_serve\/media\/spacer.png\" width=\"8\" data-unique-identifier=\"\"><\/td>\n<td valign=\"middle\" width=\"388\" style=\"min-width: 388px;\">\n<table cellpadding=\"0\" cellspacing=\"0\" border=\"0\" align=\"left\" width=\"388\">\n<tbody>\n<tr>\n<td align=\"left\" width=\"388\"><b><a href=\"https:\/\/x.com\/coryklippsten?s=43\" style=\"font-family: Helvetica, Arial, san-serif; font-size: 14px; line-height: 18px; color: #292c2f; text-decoration: none;\">Cory Klippsten &#x1f9a2; Swan.com<\/a><\/b><\/td>\n<\/tr>\n<tr>\n<td align=\"left\"><a href=\"https:\/\/x.com\/coryklippsten?s=43\" style=\"font-family: Helvetica, Arial, san-serif; font-size: 14px; line-height: 18px; text-decoration: none; color: #7e8c98;\">\u2066\u202a@coryklippsten\u202c\u2069<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/td>\n<td valign=\"top\" width=\"20\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/ea.twimg.com\/email\/self_serve\/media\/logo_twitter-1497383721365.png\" height=\"20\" width=\"24\" data-unique-identifier=\"\"><\/td>\n<\/tr>\n<tr>\n<td height=\"9\" colspan=\"4\" style=\"font-size: 0px; line-height:0px;\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/ea.twimg.com\/self_serve\/media\/spacer_464x1-1582829598167.png\" width=\"464\" height=\"1\" data-unique-identifier=\"\"><\/td>\n<\/tr>\n<tr>\n<td colspan=\"4\" style=\"font-family: Helvetica, Arial, san-serif;color: #292c2f; font-size: 18px; line-height: 24px; text-decoration: none;\">&#8220;Deepseek obviously has way more than 2048 H800s; one of their earlier papers referenced a cluster of 10k A100s.<\/p>\n<p>An equivalently smart team can\u2019t just spin up a 2000 GPU cluster and train r1 from scratch with $6m.&#8221;<\/td>\n<\/tr>\n<tr>\n<td height=\"3\" colspan=\"4\" style=\"font-size: 0px; line-height:0px;\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td colspan=\"4\"><a href=\"https:\/\/x.com\/coryklippsten\/status\/1883933328279232866?s=43&amp;t=NipKy21fekvPoZS5MA8-lQ\" style=\"font-family: Helvetica, Arial, san-serif;color: #667785; font-size: 14px; line-height: 18px; text-decoration:none;\">1\/27\/25, 11:41\u202fAM<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/td>\n<td width=\"18\" style=\"font-size: 0px; line-height: 0px; min-width: 18px;\">&nbsp;<\/td>\n<\/tr>\n<tr>\n<td colspan=\"3\" style=\"font-size: 0px; line-height: 0px;\" height=\"12\">&nbsp;<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p><\/a><\/div>\n<div>1) DeepSeek r1 is real with important nuances. &nbsp;Most important is the fact that r1 is so much cheaper and more efficient to inference than o1, not from the $6m training figure. &nbsp;r1 costs 93% less to *use* than o1 per each API, can be run locally on a high end work station and does not seem to have hit any rate limits which is wild. &nbsp;Simple math is that every 1b active parameters requires 1 gb of RAM in FP8, so r1 requires 37 gb of RAM. &nbsp;Batching massively lowers costs and more compute increases tokens\/second so still advantages to inference in the cloud. &nbsp;Would also note that there are true geopolitical dynamics at play here and I don\u2019t think it is a coincidence that this came out right after \u201cStargate.\u201d &nbsp;RIP, $500 billion &#8211; we hardly even knew you.<\/div>\n<div><\/div>\n<div>Real: &nbsp;1) It is\/was the #1 download in the relevant App Store category. &nbsp;Obviously ahead of ChatGPT; something neither Gemini nor Claude was able to accomplish. &nbsp;2) It is comparable to o1 from a quality perspective although lags o3. &nbsp;3) There were real algorithmic breakthroughs that led to it being dramatically more efficient both to train and inference. &nbsp;Training in FP8, MLA and multi-token prediction are significant. &nbsp;4) It is easy to verify that the r1 training run only cost $6m. &nbsp;While this is literally true, it is also *deeply* misleading. &nbsp;5) Even their hardware architecture is novel and I will note that they use PCI-Express for scale up.<\/div>\n<div><\/div>\n<div>Nuance: &nbsp;1) The $6m does not include \u201ccosts associated with prior research and ablation experiments on architectures, algorithms and data\u201d per the technical paper. &nbsp;\u201cOther than that Mrs. Lincoln, how was the play?\u201d &nbsp;This means that it is possible to train an r1 quality model with a $6m run *if* a lab has already spent hundreds of millions of dollars on prior research and has access to much larger clusters. &nbsp;Deepseek obviously has way more than 2048 H800s; one of their earlier papers referenced a cluster of 10k A100s. &nbsp;An equivalently smart team can\u2019t just spin up a 2000 GPU cluster and train r1 from scratch with $6m. &nbsp;Roughly 20% of Nvidia\u2019s revenue goes through Singapore. &nbsp;20% of Nvidia\u2019s GPUs are probably not in Singapore despite their best efforts. &nbsp;2) There was a lot of distillation &#8211; i.e. it is unlikely they could have trained this without unhindered access to GPT-4o and o1. &nbsp;As @altcap pointed out to me yesterday, kinda funny to restrict access to leading edge GPUs and not do anything about China\u2019s ability to distill leading edge American models &#8211; obviously defeats the purpose of the export restrictions. &nbsp;Why buy the cow when you can get the milk for free?<\/div>\n<div><\/div>\n<div><img decoding=\"async\" alt=\"image0.png\" src=\"https:\/\/www.duck9.com\/wp-content\/uploads\/2025\/01\/image0-17.png\"><img decoding=\"async\" alt=\"image1.png\" src=\"https:\/\/www.duck9.com\/wp-content\/uploads\/2025\/01\/image1-10.png\"><img decoding=\"async\" alt=\"image2.png\" src=\"https:\/\/www.duck9.com\/wp-content\/uploads\/2025\/01\/image2-16.png\"><\/div>\n<div dir=\"ltr\">\n<div dir=\"ltr\"><span style=\"background-color: rgba(255, 255, 255, 0);\">WordPress\u2019d from my personal iPhone,&nbsp;<a href=\"tel:650-283-8008\" dir=\"ltr\" x-apple-data-detectors=\"true\" x-apple-data-detectors-type=\"telephone\" x-apple-data-detectors-result=\"1\">650-283-8008<\/a>, number that&nbsp;Steve Jobs texted me on<\/span><\/div>\n<div dir=\"ltr\"><span style=\"background-color: rgba(255, 255, 255, 0);\"><br \/><\/span><\/div>\n<div dir=\"ltr\">\n<div><font color=\"#000000\"><span style=\"caret-color: rgb(0, 0, 0); background-color: rgba(255, 255, 255, 0);\">https:\/\/www.YouTube.com\/watch?v=ejeIz4EhoJ0<\/span><\/font><\/div>\n<div><span style=\"font-size: 13pt;\"><br \/><\/span><\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; &nbsp; Cory Klippsten &#x1f9a2; Swan.com \u2066\u202a@coryklippsten\u202c\u2069 &#8220;Deepseek obviously has way more than 2048 H800s; one of their earlier papers referenced a cluster of 10k A100s. An equivalently smart team can\u2019t just spin up a 2000 GPU cluster and train r1 from scratch with $6m.&#8221; &nbsp; 1\/27\/25, 11:41\u202fAM &nbsp; &nbsp; 1) DeepSeek r1 is real [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":31706,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-31705","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"post_mailing_queue_ids":[],"_links":{"self":[{"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/posts\/31705","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/comments?post=31705"}],"version-history":[{"count":0,"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/posts\/31705\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/media\/31706"}],"wp:attachment":[{"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/media?parent=31705"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/categories?post=31705"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.duck9.com\/blog\/wp-json\/wp\/v2\/tags?post=31705"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}