Just 10 days after o1's public debut, we’re thrilled to unveil the open-source version of the groundbreaking technique behind its success: scaling test-time compute 🧠💡 By giving models more "time to think," Llama 1B outperforms Llama 8B in math—beating a model 8x its size. The full recipe is open-source🤯 This is the power of open science and open-source AI! 🌍✨
Great job. But in my opinion, we should finally accept the requirement of modularity in AI. You write about self-verification and the power of strong verifiers, which are the foundation of the success of test-time scaling. Intelligence requires various components - language models are just a part. Verifiers should be treated as separate units - then we can measure how good the language model itself is, and how good the verifier is. It is the end of LLM hype, and it is the 'beginning' of (or we should finally get back to) modular AI. And think about adding additional components. LLMs will not solve AI alone. Regardless of how far we will stretch their definition.
Interesting, however while it beats a model 8x times bigger, this requires 32 times the compute? Is this worth it?
This is a remarkable achievement for open-source models. The open-source, enterprise-grade model service platform GPUStack can support all open-source models running on various GPUs. With more and better open-source models emerging, GPUStack's capabilities have been significantly enhanced. I am full of confidence in the future development of both GPUStack and open-source models. I also look forward to GPUStack establishing a closer collaboration and integration with Hugging Face as soon as possible.
For more insights into what LLMs are really capable of, check out: What Can Large Language Models Achieve? www.cohorte.co/blog/what-can-large-language-models-achieve
Great results! At our core, we're seeing similar gains with Behavior Trees in virtual worlds - where structured thinking paths dramatically scale AI reasoning without requiring larger models. Just as giving LLMs more compute time improves performance, hierarchical decision trees let us achieve deeper cognition with constrained resources.
That's a great "proof". It's even better that it's open sourced. I love this pattern - a good new development is released as a proprietary product, like o1, then the open source community responds with something just as good or better. If you're wondering where the dotted line for 0-shot CoT is, click through to the blogpost. The 0-shot CoT lines are flat lines where the MATH-500 accuracy doesn't increase with the nbr of generations per problem, because they're zero-shot.
Okay I guess this is still a work in progress but 256 generations does not seem optimal for me at the moment. But if were talking about scale then its a different story. Im really hoping meta can use this technique on something bigger. Prolly a 70B model so we can actually see how it fairs at that kind of scale. Im curious as to why there isnt any mention of or even measurement presented for latency for all the techniques that were tested. Was that not something you were looking at? Find that hard to believe. Anyway thanx alot for this work. This is a huge step forward and i believe by the end of next year we should have opensource models better than o1 at reasoning.
After reading "Scaling test-time compute with open models," I can see excellent applications for my "Agentic AI for Content Marketing" project. It has opened up new possibilities and reignited my passion for the project. The most interesting aspects are: ✅ Building a self-verification system for content quality assessment - mentioned in "Where to go from here?" as the "holy grail" where models can validate their own outputs. ✅ Developing an efficient content pipeline using: • Best-of-N sampling • Beam search • Diverse Verifier Tree Search (DVTS) ✅ Creating a feedback loop system using Process Reward Models (PRM) for continuous content quality improvement and evaluation. Thank you for sharing these techniques and open-source code. I believe this will significantly accelerate the development of my project 🙏
Quote: "Llama 1B outperforms Llama 8B in math". Can it solve the following math problem? See the prompt below. It's a high school problem taken from the CIE high school pure math textbook. ChatGPT is able to solve the problem step by step up to the answer, while Gemini attempted to solve it in a few steps but didn't reach the answer. So, Gemini attempted the problem in a half-finished manner. The math-solving capability of ChatGPT is impressive, at least for high school-level math. ChatGPT surpasses the step-by-step solving capabilities of sophisticated computer algebra systems (CAS) like Mathematica, Maple, and MATLAB. _____ Prompt _____ Solve the definite integration of the following equation, '(2*x^3 + 3*x^2 + 28)/((x + 2)*(x^2 + 4))' with respect to x, where the lowerbound x=1 and upperbound x=2. Also show step by step from start to the solution.
Co-founder & CEO at Hugging Face
1dThe detailed blogpost is here: https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute