In recent weeks, I’m sure you’ve noticed that there’s been a ton of excitement over DeepSeek, a Chinese A.I. company that was spun out of a Chinese hedge fund just two years ago. DeepSeek’s “V3” stream-of-consciousness chatbot-style model caught the world’s attention because it was able to perform near state-of-the-art models like OpenAI’s GPT-4o and Google’s Gemini-2.0-Flash, but it was DeepSeek’s OpenAI-o1-like reasoning model (which you can hear more about in Episode #820) that caused huge economic disruption such as both Nvidia’s share price falling by 17% and the Nasdaq falling several percent last Monday. At the time of writing, DeepSeek’s “r1” reasoning model is statistically (95% confidence-interval) tied for first place on the overall LM Arena leaderboard (hear all about this great LLM leaderboard in Episode #707 with Prof. Joey Gonzalez) with GPT-4o and Gemini-2.0-Flash. This caught global attention first because DeepSeek is an obscure Chinese company while all the previous top models were devised by American (specifically, Bay Area) tech giants. More consequentially than even great-power geopolitics, however, DeepSeek’s r1 caused a global economic tsunami because it is comparable in performance to the best OpenAI, Google and Anthropic models while costing a fraction as much to train.
There are all kinds of complexities, externalities and estimates to take into account when trying to make a comparison in cost between two different LLMs at two different companies (for example, what about the cost of training runs that didn’t pan out?), but speaking in rough approximations, training a single DeepSeek V3 or r1 model appears to cost on the order of millions of dollars, while training a state-of-the-art Bay Area model like o1, Gemini or Claude 3.5 Sonnet reportedly costs on the order of hundreds of millions of dollars, so about 100x more.
As I’ve stated on this show several times, even without conceptual, scientific breakthroughs, simply scaling up the transformer architecture that underlies o1, Gemini or Claude (such as by increasing training-data-set size, increasing the number of model parameters, increasing training-time compute or, in the case of reasoning models like o1, inference-time compute) will lead to impressive LLM improvements that overtake more and more humans on cognitive tasks and bring machines in the direction of AGI (check out Episodes #748 and #820 for more on this). Implicit in this statement is that, if researchers can devise major conceptual scientific breakthroughs with respect to how machines learn, we could accelerate toward AGI even more rapidly.
If conceptual breakthroughs on A.I. model development can allow machines to improve their cognitive capabilities while also learning more efficiently, this would reduce server-farm energy consumption, loss of freshwater through server cooling and, of course, plain old financial cost. DeepSeek has achieved such a conceptual breakthrough via combining a number of existing ideas like Mixture-of-Experts models (learn about this in Episode #778) with brand-new, major efficiencies such as a GPU communications accelerator called DualPipe that schedules the way data pass between the couple thousand GPUs DeepSeek appeared to train r1 with to get the breathtaking results they did.
Now, 2000 GPUs might sound like a lot, but it’s again about 1% of the number of chips Meta’s Mark Zuckerberg and xAI’s Elon Musk brag about procuring in a given year for potentially training a single ever-larger next LLM. I’m not going to go further into the technical details of the DeepSeek models in this episode, but if you’d like to dig into the technical aspects more deeply, I have provided a link to DeepSeek’s full r1 paper as well as an exceptionally detailed, well-written blog post on an online tech news site called NextPlatform.
Moving beyond technical aspects to geopolitics, DeepSeek’s success demonstrates that American sanctions that prevent Chinese firms from accessing the latest, most powerful Nvidia chips have been ineffective. These sanctions were explicitly designed to prevent China from being able to overtake the US on the road to AGI (particularly given the military implications of having access to a machine that could far exceed human cognitive capabilities), but now a Chinese firm has figured out how to approach US firms’ AI capabilities with ~1% of the quantity of chips at ~1% of the cost and using less-capable Nvidia chips than American firms have access to. (In a separate quandary for the Chinese Communist Party, for geopolitical reasons they’d probably prefer that DeepSeek’s intellectual property be kept proprietary and, yet, DeepSeek graciously open-sourced their work for the world to leverage and advance AI research as well as AI application development.)
All of the DeepSeek V3 and R1 source code and model weights are available on GitHub and can be used under a highly permissive MIT license. All aspects of proprietary models like those from OpenAI, Google, Anthropic and xAI are, well, proprietary so that’s another big positive for the AI community from the folks at DeepSeek. This level of openness is far beyond so-called “open” LLMs like Meta’s Llama family because Meta provides model weights but not source code and Meta’s unusual license includes constraints such as limiting Llama model usage to companies with fewer than 700 million active users.
Beyond providing their models open-source, DeepSeek also created an iOS app (it was #1 in the Apple app store at the time of recording this episode), but I would caution you against using the DeepSeek app because, per the app’s privacy policy, anything you input into DeepSeek’s app is collected by the company and stored on servers in China. If you’d like to privately use a DeepSeek model but don’t want to take the time or money to download their model weights and run it on your own hardware, you can use a platform like ollama r1 model to do that.
Ok, so hopefully you’re excited that you now have untethered access to state-of-the-art AI capabilities, but that should only be the beginning of your excitement. Markedly more efficient LLM training does make recent $6B raises by OpenAI, xAI and Anthropic — much of which would have been earmarked for training ever-larger transformer architectures for ever-longer inference times — look like it may no longer be well-allocated. And the DeepSeek release ended up being coincidentally but nevertheless comically timed with the announcement of the $500B Stargate A.I.-infrastructure project (that included the CEOs of OpenAI, Oracle and Softbank alongside Donald Trump), an enormous figure that probably only made sense when beancounters assumed LLMs would keep growing and growing by orders of magnitude in the coming years. And, correspondingly, Nvidia’s share price took a 17% hit in one day (at the time of writing, some of this had recovered) as shareholders realized the LLM size increases they’d baked into future GPU orders may no longer come to fruition.
But for most of us (certainly for me and probably for most listeners) markedly more efficient LLM training and a rehashing of the open-source model that dominated AI model research until just a few years ago is fabulous news. Increased LLM efficiency in particular means fewer environmental issues associated with A.I. and it means that developing, training and running A.I. models is more economical and, therefore, developing practical, A.I. applications becomes cheaper and more widely available for all to use and benefit from around the world. These are exciting times indeed. Dream up something big and make it happen! There’s never been an opportunity to make an impact like there is today.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.