OpenAI just claimed to have made the first dramatic "jump" in LLM capacities since GPT-4. The first public evaluation of o3 broke SOTA by a large margin across multiple "hard" benchmarks. The most publicized one was ARC AGI, a hard set for LLM but relatively easy for humans and, likely, VLMs. ARC AGI is a collection of visual puzzles where you have to intuitively fill some parts. You can think of it like a Sudoku with pixels. The most interesting part was not the results themselves but the speed of improvement. O3 was trained just a handful of months after O1. As Ilya Sutskever said in Neurips, the "data wall" is just a limitation of current architectures. So what is O3 made of? As usual, OpenAI didn’t disclose and it isn’t even certain there are significant changes from O1. The overall picture is inference scaling: for each hard problem, O3 generated as much as 60 million tokens, which makes it the most expensive test ever run (about… 1.6 million dollars for all the questions, and yes you can train many good LLMs for this amount of compute). Now what does this uncovers? *The model’s "thoughts", writing extensively in the background about the question at hand, formulating different hypothesis and internal tests before finally settling for the most likely solution? This part is mostly settled during post-training with a mix of synthetic data and, even more crucially, Reinforcement Learning. *A "verifier" step, also managed internally by the model or, maybe even, with access to external tools? For the high compute variant, O3 generated as much as 1024 answers for each questions and then had to filter out the best one. *Relying on "search" and exploring many generation paths at once? This was the original approach promoted behind the early versions of O1. Today, I’m increasingly thinking "search" covers a wide variety of strategy leveraging the model internal metrics (entropix, SAE, attention…), not just the most likely tokens and beams. *Even more brutally, just some form of "layer looping"? Each models is made of different layers that are almost some partial models under the hood that can run concurrently. This would be a "bitter lesson" type of solution as ultimately this is more about brute force than clever design. There is a fifth solution that brings a more pessimistic picture: fine-tuning. O3 was trained on 75% of the public ARC AGI set (not clear which part and how) and hasn’t been tested on the private unreleased set. This type of benchmark chasing is currently widespread: most of the highly publicized LLMs are trained on the set they claim to perform at. If O3 holds on, the real question is on the inference side. There isn’t even enough GPU in the world to make O3 working at scale for all the emerging business agents demand. The real challenge that starts now will not be scaling up, but scaling down.
Double down on the scale down comment. Boris Gamazaychikov just shared his findings on the environmental cost to task o3 and it is very clear this is not sustainable.
I remember having better llm tuned on less but better quality data being popular not that long ago, and we also have good slm. I feel like llm improvement is going to be an iterative process of brute force increases and pruning steps until next paradigm :)
I believe their strategy for scaling relies on training smaller models on specific goals in their chain of thought processes. They make their bigger models generate lots of «thought processes » and evaluate them, to select only the most relevant one to train the next smallest models.
Great post as always, just it does seem to me like they tested o3 on the semi-private ARC AGI dataset and there's no obvious sign of overfitting, especially with a sota improvement of this margin. I am cautiously optimistic that significant efficiency improvements can be made now that it has been shown this kind of performance is possible with brute force.
Thankfully there are technically aware, non-fanatic and scientifically minded AI leaders out there like yourself correcting the hypedrunk AI religious claiming the imminent arrival of a new species. Thank you.
The real challenge will be indeed in effectively scaling down 🔥 seriously great post ! 🤌🏻
Transformation, Clean Energy, Innovation, Sustainability. Born 330 ppm
18hSimon Gosset Etienne Grass Martin Chauvin