Neel Master’s Post

3mo Edited

The release of OpenAI Strawberry (o1) is another reason why scaling inference efficiently is important. Inefficient workload utilization drives up cloud costs and reduces gross margins for AI companies. Cedana improves gross margins from 20%-400% by automatically and safely suspending GPU workloads when not in use and instantly resuming them upon user demand. https://lnkd.in/ekBJKQuB

Jim Fan

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.

3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.

1 Comment

Len Mazur

3mo

cool!. You familiar with these guys? https://www.enchargeai.com/technology. - increasing inference throughput on the edge.

1 Reaction

To view or add a comment, sign in

More Relevant Posts

John T. Kane

Lead Product Manager, Search Evangelist
3mo
Report this post
A most intriguing quote: "As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter." - It is indeed time to shift the focus to #search I believe Jim is quoting from this March 2019 article. https://lnkd.in/gar3MRjf It is a good confirmation that many are coming around to understand the value of search (in all it's forms) in 2024 and into the future! Everyone is a Searcher. Those who search twice are called Researchers. Those who search thrice, go a step beyond... - JTK.
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in
Ivan Chan

AI Copywriter
3mo Edited
Report this post
The Inference Scaling Revolution: A Shift in Focus **The emergence of OpenAI Strawberry (o1) marks a significant shift in the paradigm of large language models (LLMs).** By prioritizing inference-time scaling, o1 challenges the conventional wisdom that larger models are always better. This shift is rooted in the #understanding that #reasoning, rather than #memorization, is the key to #effective #AI. **1. #Decoupling Reasoning from Knowledge:** * **Efficiency:** A smaller "reasoning core" can be more efficient, as it focuses solely on logical processes. * **Flexibility:** This decoupling allows for easier integration of external tools and knowledge bases. * **Reduced Pre-training:** By reducing the reliance on pre-trained knowledge, the computational cost of training can be significantly lowered. **2. #Inference-Time Scaling:** * **Iterative Refinement:** Similar to AlphaGo's MCTS, LLMs can explore multiple strategies and scenarios to find optimal solutions. * **Learning from Experience:** By simulating various situations, the model can learn and improve its reasoning abilities over time. * **Computational Efficiency:** While inference-time scaling requires additional compute, it can be more efficient than scaling model parameters, especially for specific tasks. **3. #The Inference Scaling Law:** * **Recent Discoveries:** The recent research papers by Brown et al. and Snell et al. highlight the effectiveness of scaling inference compute. * **OpenAI's Foresight:** It is likely that OpenAI has been exploring this approach for some time, given their early success with o1. **4. #Challenges of Productionization:** * **Decision-Making:** Determining when to stop searching, setting appropriate reward functions, and deciding when to call external tools are complex challenges. * **Computational Cost:** Balancing the benefits of inference-time scaling with the computational costs of external processes is crucial. * **Lack of Detail:** OpenAI's research post has not provided sufficient information about their specific approaches to these challenges. **5. #A Data Flywheel:** * **Continuous Improvement:** By using the search traces as training data, o1 can continuously refine its reasoning core. * **Positive and Negative Feedback:** The training data includes both positive and negative examples, providing a rich learning environment. * **Similarities to #AlphaGo:** The process is analogous to how AlphaGo's value network improved through MCTS-generated training data. In conclusion, **#o1 represents a paradigm shift in LLM development**. By focusing on inference-time scaling and decoupling reasoning from knowledge, it opens up new possibilities for creating more efficient and capable AI systems. The challenges of productionization will require further research and development, but the potential benefits are significant.
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in
Cheng-Kuang Wu

Research Scientist @Appier / Medical Doctor (M.D.) @JC Clinic
3mo
Report this post
Funny how I was just talking about inference-time scaling with my colleague, and OpenAI o1 is out. Leveraging more compute during inference is a rather straightforward idea to come by, since humans also think before we speak. The real challenge is how to properly train models to do it. I’m looking forward to reading the latest works on this topic!
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in
Muhammad Umair Mohsin

Gen AI | Data Analytics | Python | LLMs | Business Analyst | Consultant | SAP
3mo Edited
Report this post
🍓 𝐓𝐡𝐞 𝐦𝐮𝐜𝐡 𝐚𝐰𝐚𝐢𝐭𝐞𝐝 𝐧𝐞𝐰 𝐎𝐩𝐞𝐧𝐀𝐈 𝐨1 𝐢𝐬 𝐡𝐞𝐫𝐞 ⭐ The model uses a "hidden" 𝐜𝐡𝐚𝐢𝐧-𝐨𝐟-𝐭𝐡𝐨𝐮𝐠𝐡𝐭 𝐩𝐫𝐨𝐜𝐞𝐬𝐬 (COT), enabling it to think through problems in a more human style. 1️⃣ It turns out that this more in-depth thinking improves test-time performance considerably, enabling better, more accurate findings with extended processing (10–20 seconds). 2️⃣ The longer the model takes to evaluate a task, the more accurate and robust its results are usually. 📈 Performance and Benchmarks: ➡ The model outperforms human PhD-level accuracy on a benchmark of problems in chemistry, biology, and physics (GPQA) ➡ Programming: Ranked in the 89th percentile in Codeforces, showcasing advanced problem-solving and coding abilities. Well the downside is that I believe the cost of using this will be higher than GPT-4 and other models but didn't find the exact breakdown on the website yet. Lastly for the approach this model uses can be read in the paper below: https://lnkd.in/eHTJvdxA https://lnkd.in/eFyt3ggF #OpenAI #OpenAIo1 #LLMs #Reasoning #ChainofThought
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in
Eric Chevallet
3mo
Report this post
Unless your phone has been on Airplane mode since Thursday evening, it was hard to miss the announcement of OpenAI Strawberry. My LinkedIn feed was basically a continuous loop of the same video, reshared over and over by the community… Until I stumbled across this post by Jim Fan. It’s a gem that cuts through the noise and explains the shift in AI focus from pre-training to inference. So what’s the takeaway? Instead of constantly making AI models bigger and bigger by cramming in more knowledge, the smarter approach is to use smaller models that know how to ‘think’ better. These smaller models focus on solving problems by using external tools, like a browser, instead of trying to remember everything themselves. It’s like the difference between memorizing a ton of trivia versus knowing where to quickly find the answers.
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in
Dinkar Gupta

Chief Information and Technology Officer (CIO/CTO) at KPMG Switzerland
3mo
Report this post
Good one... on the other hand, reminds me of big data.. no thin and useful data big (large) models.. no small models.. ... (not touching monoliths and microservices and many such debates in other contexts) then cycle again.. sometime on other side of spectrum Anyhow... Learning & evolution continues Fundamentally in this case... We acknowledge that system 1 (zero or one shot) is not sufficient, system 2 matters (reasoning) so spend more time in deliberate thinking before responding. "Think before you speak". Machines paying attention to good old wisdom as they start to be be more agentic ;-) and we are not even talking about "agency" aspect of these agents yet! #fridaythoughts #agenticsai #thinkingfastandslow #agency #hardthingabouthardthings
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
1 Comment
Like Comment
To view or add a comment, sign in
Andrew C.

0 to 1 Product pragmatist | AI digital transformations | Strategy | Technical | Commercial | Turn complex technologies into products customers love
3mo
Report this post
Happy Friday, AI enthusiasts! While you're gearing up for the weekend, let's chat about the juicy news in the AI world - OpenAI's latest model - o1-preview. Jim Fan dropped a great visual showing why this "Strawberry" model is crushing it compared to its predecessors. The big breakthrough? Reasoning skills that'll make you do a double-take. 🤯 But here's a thought-provoking question from Yuhong Sun to mull over with your Friday coffee: "If I can solve harder math problems with a pen and paper instead of mental math, am I smarter, or do I just have better tools?" 🤔 Food for thought, right? Now, before we get too excited, let's keep it real: Creative writing hasn't seen much improvement from GPT-4. No single model is the be-all and end-all (yet). Choose your AI model like you choose your Friday night plans - based on what works best for you! What are your thoughts on this AI evolution? Drop your hot takes below! 👇 #FridayAIThoughts #OpenAI #AIInnovation #TechTalk
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in
kai luo

🍎Apple Die Hard Fan｜苹果骨灰粉 🤖AIGC Observer ｜ AIGC观察者 👨🏻🎤Cutting Edge Tech Enthusiast ｜科技爱好者
3mo
Report this post
Actually, the release of Strawberry gives me the feeling that it’s another angle proving the struggles LLMs face when it comes to actual planning. The reliance on scaling laws to spark stronger reasoning abilities seems to have hit a dead end. So, it seems like planning needs to be approached from a new architecture, not just relying on LLMs for inference. OpenAI hasn’t revealed the new models’ architecture, but I lean towards Jim Fan’s guess that OpenAI’s Strawberry (o1) doesn’t require massive models for reasoning. The new approach involves extracting reasoning from knowledge, akin to a small “reasoning core” knowing how to use tools like a browser and a code validator. This shift might decrease pre-training computation and move a bulk of calculations to reasoning services, rather than pre-train/post-train calculations.
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in
Anton Künzi

MSc Mathematics @ ETH Zürich
3mo
Report this post
Observing how the noise surrounding AI has made it increasingly difficult to get a solid picture of where we stand today, especially for our peers coming from outside the field - I wanted to share a short exposition, about a recent advance whose consequences will shape the field for years to come. (presented by NVIDIA's great Jim Fan) It covers the emerging* paradigm of "scalable inference-time compute", or simply: How you manage to think before you speak :)
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
1 Comment
Like Comment
To view or add a comment, sign in
Xuan Han

LLM Agentic Systerm, Leadership
2mo
Report this post
In my GPT agent systems, I would sample 10 times(10 random walks /searchs) for the same task and do voting in the end. I also told my end user to open 3 tabs to run a complicated task at the same time for cross check. Now finally, it’s great to know OpenAI did it in a much more efficient and advanced way. However, it would take 5-10x more time to search for a query than before. We need more powerful chips. NVDA’s run is just a beginning.
Jim Fan Jim Fan is an Influencer

NVIDIA Senior Research Manager & Lead of Embodied AI (GEAR Group). Stanford Ph.D. Building Humanoid robot and gaming foundation models. OpenAI's first intern. Sharing insights on the bleeding edge of AI.
3mo

OpenAI Strawberry (o1) is out! We are finally seeing the paradigm of inference-time scaling popularized and deployed in production. As Sutton said in the Bitter Lesson, there're only 2 techniques that scale indefinitely with compute: learning & search. It's time to shift focus to the latter. 1. You don't need a huge model to perform reasoning. Lots of parameters are dedicated to memorizing facts, in order to perform well in benchmarks like trivia QA. It is possible to factor out reasoning from knowledge, i.e. a small "reasoning core" that knows how to call tools like browser and code verifier. Pre-training compute may be decreased. 2. A huge amount of compute is shifted to serving inference instead of pre/post-training. LLMs are text-based simulators. By rolling out many possible strategies and scenarios in the simulator, the model will eventually converge to good solutions. The process is a well-studied problem like AlphaGo's monte carlo tree search (MCTS). 3. OpenAI must have figured out the inference scaling law a long time ago, which academia is just recently discovering. Two papers came out on Arxiv a week apart last month: - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling. Brown et al. finds that DeepSeek-Coder increases from 15.9% with one sample to 56% with 250 samples on SWE-Bench, beating Sonnet-3.5. - Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. Snell et al. finds that PaLM 2-S beats a 14x larger model on MATH with test-time search. 4. Productionizing o1 is much harder than nailing the academic benchmarks. For reasoning problems in the wild, how to decide when to stop searching? What's the reward function? Success criterion? When to call tools like code interpreter in the loop? How to factor in the compute cost of those CPU processes? Their research post didn't share much. 5. Strawberry easily becomes a data flywheel. If the answer is correct, the entire search trace becomes a mini dataset of training examples, which contain both positive and negative rewards. This in turn improves the reasoning core for future versions of GPT, similar to how AlphaGo’s value network — used to evaluate quality of each board position — improves as MCTS generates more and more refined training data.
Like Comment
To view or add a comment, sign in

2,748 followers

22 Posts

View Profile Connect

Neel Master’s Post

More Relevant Posts

Explore topics