Today’s episode, which, given the gravity of the event, could of course be none other than OpenAI’s new o1 series of models, which represent a tremendous leap forward in AI capabilities.
So far, OpenAI have released o1-preview and o1-mini. Unless otherwise stated in this episode, I’m going to be talking about o1-preview, which is now (in my view, unquestionably) the state of the art in terms of any publicly available AI model. o1-mini, on the other hand, is a smaller (and, therefore, 80% cheaper to run) model that was trained on the same o1 protocols.
In a nutshell, and as detailed last year in Episode #740 on OpenAI’s Q* project (later renamed to “Strawberry”), the o1 large language model was trained with reinforcement learning to “think” before responding via a private “chain of thought”. Working through problems slowly and carefully like this is analogous to the slow “System 2” thinking popularized by the Nobel-prize winning economist Daniel Kahneman in his book “Thinking, Fast and Slow”. Previously, all of the top public LLMs executed solely in a mode more like human “System 1” thinking, which is faster but more like intuition, when you just speak without careful consideration. Slow, “System 2” thinking — like when you work through a challenging math problem step by step with pencil and paper — allows OpenAI’s new o1 model to iteratively refine its outputs, try out different strategies, and even recognize and correct its own mistakes.
What’s crazy about this is that, the longer the o1 model “thinks”, the better it does on complex tasks. As pointed out by Dr. Noam Brown (one of OpenAI’s o1 researchers and our guest in episode #569) in an excellent Tweet thread, this provides a whole new dimension for scaling AI models. Previously, we could scale by increasing the amount of high-quality training data, increasing the number of training parameters in the model or increasing training time. Now, “thinking” time during inference can be scaled up too. This means that, while the o1 model currently available “thinks” on the scale of seconds when it’s preparing a response for you, the research team at OpenAI is aiming for future versions that “think” for hours, days or weeks before generating a response.
This dimension of scaling up inference time will of course scale up the cost of inference, but for high-impact outcomes — a new cancer drug, a breakthrough on nuclear fusion, mathematical proofs that humans haven’t been able to crack — that higher inference cost would be well worth it. With implications for the singularity, longer “thinking” times will also surely lead to AI models like o1 contributing to the development of even better AI models, creating a positive feedback loop that could accelerate shockingly rapidly.
In terms of o1’s capabilities this very day, it’s critical to note that o1 doesn’t always perform better than the other leading LLMs like OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet or Google’s Gemini. This is because for tasks like chat conversations, composing an email or editing a paper, this doesn’t typically require slow “System 2” thinking but can be done with fast, intuitive “System 1” thinking. Instead, where o1 excels is on the same kinds of tasks that you need to spend time deliberating on before blurting out a response, such as when you’re writing intricate computer code, performing data analysis or solving math problems.
On complex tasks, however, o1 is comparatively unreal… like, way better than anything else out there; it’s a really big deal. On the usual benchmarks that we’ve seen in recent years like MMLU and high school (Advanced Placement) exams, o1 offers improvements; on MMLU categories like math and logic as well as on AP exams like physics and calculus, the improvement of o1 over GPT-4o is huge. Perhaps most mind-blowingly of all, according to OpenAI’s own evaluations, o1 performs comparably to PhD-level students on specific questions in physics, chemistry, and biology.
As a striking demonstration of what’s to come soon through scaling further, OpenAI also teased us with preliminary results from an o1 model that is still in development. In competitive programming, this in-development o1 model ranked in the 89th percentile on Codeforces questions compared to 62% from the o1-preview model that’s publicly available today and a lowly 11% from GPT-4o. Likewise, on a qualifying exam for the International Mathematics Olympiad, OpenAI reports that the forthcoming o1 model scored 83%, the publicly-available o1-preview scored 62%, and the once-mighty-now-suddenly-humble-looking GPT-4o scored only 13%.
Now, it is of course important to approach these claims with a healthy dose of skepticism because AI benchmarks can be unreliable and easy to game. In this particular case, however, on many complex tasks I tested personally, the delta between o1 and any other text-generating model available today is so vast that I am confident evangelizing to you that o1 is a serious game-changer; the difference in capabilities is night and day, just like the jump from GPT-3.5 to GPT-4 was last year.
One interesting demonstration of o1's abilities is its capacity to count the number of R's in the word "strawberry" - a task that has stumped many previous language models due to their tokenization processes. This might seem trivial, but it actually represents a significant advancement in the model's ability to process and understand language at a character level.
Given these terrific capabilities, there’s also the risk of o1 being terrifying if it were in the wrong hands. So, on the safety front, OpenAI claims to have developed a new training approach that leverages the models' reasoning capabilities to better adhere to safety and alignment guidelines. They report significant improvements in the model's ability to resist "jailbreaking" attempts. For example, while GPT-4o scored 22 out of 100 on a particularly stringent jailbreaking test OpenAI says they have, o1 scored 84 out of 100, ostensibly flipping these tricky jailbreaks from usually being possible to being relatively rare.
Finally, on the note of machine consciousness, which always seems to be a point of discussion whenever a markedly more capable AI model is released, it's crucial to maintain perspective. While I used anthropomorphizing language in this episode (and lots of people in the industry do), don’t forget that these AI systems don’t actually think or reason. The underlying computational mechanisms are the same as your calculator or a spell-checker on your computer, we’re simply figuring out how to leverage these non-conscious computational processes in increasingly nuanced and powerful ways.
Access today:
ChatGPT Plus or Team
o1-preview: 50 messages per week
o1-mini: 50 messages per day
Tier 5 developers using the OpenAI API
o1-preview: 100 requests per minute
o1-mini: 250 requests per minute
Personal Preferences on text-generating A.I. models:
Claude 3.5 Sonnet for most everyday or creative tasks
o1 for tasks that require slow, detailed “System 2”-style thinking like math and complex programming questions
Google Gemini for anything that requires a huge context window like passing in large audio or video files
So, in conclusion, OpenAI’s o1 model is a really big deal. This is, in my view, unquestionably the state of the art in AI capabilities today. And, thanks to the potential of scaling “thinking” time at inference, we are not far off from even more staggering and world-changing AI models. Today, o1 demonstrates that we can have PhD-level thinkers in specific fields thinking 24/7 about tricky problems and, while expensive today, these costs will converge toward very cheap very fast. Scaling to longer thinking, these very cheap, abundant sources of intelligence may soon have reasoning capabilities far beyond PhD students. Humming away 24/7 all around the world working on our most pressing problems, technological progress may be accelerating to AGI and the Singularity surprisingly soon.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.