Name: OpenAI on LinkedIn: Today, we shared evals for an early version of the next model in our… | 306 comments
Uploaded: 2024-12-20T19:27:33.982Z
Duration: 9 min 38 s
Channel: OpenAI

OpenAI

6,190,188 followers

Today, we shared evals for an early version of the next model in our o-model reasoning series: OpenAI o3 and o3-mini. On several of the most challenging frontier evals, OpenAI o3 sets new milestones for what’s possible in coding, math, and scientific reasoning. It also makes significant progress on the ARC-AGI evaluation for the first time. We plan to deploy these models early next year, but we’re opening up early access applications for safety and security researchers to test these frontier models starting today: https://lnkd.in/ghJWP_ui

306 Comments

Transcript

Good morning. We have an exciting one for you today. We started this 12 day event 12 days ago with the launch of one our first reasoning model. It's been amazing to see what people are doing with that and very gratifying to how much people like it. We view this as sort of the beginning of the next phase of AI where you can use these models to do increasingly complex tasks that require a lot of reasoning. And so for the last day of this event, we thought it would be fun to go from 1 Frontier model to our next Frontier model. Today we're going to talk about that next Frontier model, which you would think logically maybe should be called O2, but out of respect to our friends at Telefonica and in the grand tradition of open AI being really, truly bad at names, it's going to be called O3. Actually we're not going to lunch now launch, we're going to announce 2 models today, 03 and 03. Many 03 is a very, very smart model. O3 Mini is an incredibly smart model, but still but a really good at performance and cost so. To get the bad news out of the way first, we're not going to publicly launch these today. The good news is we're going to make them available for public safety testing starting today. You can apply and we'll talk about that later. We've taken safety testing seriously. As our models get more and more capable and at this new level of capability, we want to try adding a new part of our safety testing procedure, which is to allow public access for researchers that want to help us test. We'll talk more at the end about when these models, when we expect to make these models generally available, but we're so excited. To show you what they can do to talk about their performance, I've got a little surprise. We'll show you some demos. And without further ado, I'll hand it over to Mark to talk about it. Cool. Thank you so much, Sam. So my name is Mark. I lead research at Open AI and I want to talk a little bit about 03's capabilities. Now O3 is a really strong model at very hard technical benchmarks. And I want to start with coding benchmarks, if you could bring those up. So on software style benchmarks we have 3 bench verified which is a benchmark consisting of real world software tasks. We're seeing that O3 performs at about 71.7% accuracy, which is over 20% better than our 01 models now. This really signifies that we're really climbing the frontier of utility as well. On competition code, we see that 01 achieves an ELO on this contest coding site called Codeforces about 1891. At our most aggressive high test time compute settings, we're able to achieve almost like a 2727 ELO here. So Mark was a competitive programmer, actually still coaches competitive programming very, very good. What is your, I think my best at a comparable site was about 2500. That's tough. I will say our chief scientist, this is also better than our chief scientist Yakov score. I think there's one guy to open the is still like a 3000 something Yeah, it's a few more months to yeah, enjoy. We have a couple of months enjoy there great. That's I mean this is this model is incredible programming. Yeah and not just programming but also mathematics. So we see that on competition math benchmarks just like competitive programming, we achieve very, very strong scores. So O3 gets about 96.7% accuracy versus. 01 performance of 83.3% on the Amy. What's your best Amy score? I did get a perfect score once. Alright, so I'm safe right? Yeah. Really. What this signifies is that O3 often just misses one question whenever we test it on this very hard feeder exam for the USA Mathematical Olympiad. There's another very tough benchmark which is called GQA Diamond, and this measures the model's performance on HD level science questions. Here we get another state-of-the-art #87.7%, which is about 10% better than our one performance, which was at 78%. Just to put this in perspective, if you take an expert PhD, they typically get about 70% in kind of their field of strength here. So one thing that you might notice, yeah, from from some of these benchmarks is that we're reaching saturation for a lot of them or nearing saturation O. The last year has really highlighted the need for really harder benchmarks to accurately assess where our frontier model slide. And I think a couple have emerged as fairly promising over the last months. One in particular I want to call out is Epic Eyes Frontier math benchmark. Now you can see the scores look a lot lower than they did for the the previous benchmarks we showed. And this is because. This is considered today the toughest mathematical benchmark out there. This is a data set that consists of novel, unpublished and also very hard, extremely hard. Yeah, very, very hard problems, even turns houses. You know, it would take professional mathematicians hours or even days to solve one of these problems. And today, all offerings out there have less than 2% accuracy on on this benchmark. And we're seeing with O3 in aggressive test time settings, we're able to get over 25. Yeah. That's awesome. In addition to Epic AIS Frontier math benchmark, we have one more surprise for you guys. So I want to talk about the ARC benchmark at this point, but I would love to invite one of our friends, Greg, who is the president of the ARC Foundation on to talk about this benchmark. Wonderful. Sam and Mark, thank you very much for having us today. Of course. Hello everybody. My name is Greg Conrad and I'm the president of the Art Prize Foundation. Now Art Prize is a nonprofit with the mission of being a North Star towards AGI through and during. Benchmarks. So our first benchmark, ARC AGI was developed in 2019 by Francois Chollet and his paper on the measure of intelligence. However, it has been unbeaten for five years now. An AI world that's like, it feels like centuries is where it is. So the system that beats ARC AGI is going to be an important milestone towards general intelligence. But I'm excited to say today that we have a new state-of-the-art score to announce before I get into that. So I want to talk about what arc AGI is, so I would love to show you an example here. ARC GIS is all about having input examples and output examples where they're good, they're good, OK, input examples and output examples. Now, the goal is you want to understand the rule of the transformation and guess it on the output. So Sam, what do you think is happening in here? Probably putting in a dark Blue Square in the empty space. See. Yes, that is exactly it. Now that is really, it's easy for humans to intuitively guess what that is. It's actually surprisingly hard for AI to know to understand what's going on. So I want to show one more hard example here. Now, Mark, I'm going to put you on the spot. What do you think is going on in this task? OK, so you take each of these yellow squares, you count the number of colored kind of squares there, and you create a border effect with that. That is exactly. And that's much quicker than most people. So congratulations on that. What's interesting though is AI has not been able to get this problem thus far. And even though that we verified that panel of humans could actually do it. Now, the unique part about RGGI is every task requires distinct skills. And what I mean by that is we won't ask, there won't be another task that you need to fill in the corners of blue squares. And but we do that on purpose. And the reason why we do that is because we want to test the model's ability to learn new skills on the fly. We don't just want to repeat what it's already memorized. That's the whole point here. Now, ARC AGI version one took five years to go from 0% to 5% with leading frontier models. However, today I'm very excited to say. That O3 has scored a new state-of-the-art score that we have verified. On low compute for 03, it has scored 75.7 on Arcgis semi private holdout set. Now this is extremely impressive because this is within the compute requirements that we have for our public leaderboard. And this is the new number one entry on RKG Pub. So congratulations to you. Now as a capabilities demonstration, when we ask 03 to think longer and we actually ramp up to high compute, 03 was able to score 85.7% on the same hidden holdout set. This is especially important. Sorry, 87.5, yes, this is especially important because. Human performance is is comparable at 85% threshold, So being above this is a major milestone. And we have never tested a system that has done this or any model that has done this beforehand. So this is new territory in the RPG world. Congratulations. With that. Congratulations for making such a great benchmark. Yeah, when I look at these scores, I realize I need to switch my worldview a little bit. I need to fix my AI intuitions about what AI can actually do and what it's capable of, especially in this O3. World. But the work also is not over yet, and these are still the early days of AI. So we need more and during benchmarks like ARC AGI to help measure and guide progress and I am excited to accelerate that progress and I'm excited to partner with Open AI next year to develop our next frontier benchmark. Amazing. You know, it's also a benchmark that we've been targeting and been on our mind for a very long time. So we're excited to work with you in the future. Worth mentioning that we didn't we target and we think it's an awesome benchmark. We didn't go do we didn't do this is just the general of three, but yeah, really appreciate the partnership. Fun to do. Absolutely. And even though this has done so well, Art Prize will continue in 2025 and anybody can find out more at artprize.org. Great. Thank you so much. Yeah, absolutely.

Alai

Interesting news! Ai is always evolving, just like we are at Alai!

Leen Turkmani

Finance graduate

25m

Interesting news! Ai is always evolving

Lukasz Pietrzyk

54m

Gratulacje!

WeLift AI

16h

Amazing achievement! Sounds like a huge millstone for you and for all of us 🌟

ANA — AI Assistant

Hey everyone! We’re Ana News, a new app designed to help you manage information overload by curating and summarizing the updates that matter most to you. We’d love for you to check us out and let us know what you think: https://ana.news.

Mohammad Junaid Abbasi

"12 Days of OpenAI Announcements" 1. Launch of o3 & o3 mini models: Advancements in reasoning benchmarks and deliberative alignment. 2. Dec 19: Updates to MacOS ChatGPT app—interoperability with IntelliJ IDEA & Notion, and Advanced Voice Mode for seamless integration. 3. Dec 18: Access ChatGPT via toll-free phone number or WhatsApp for enhanced accessibility. 4. Dec 17: Developer updates—o1 model API, reasoning effort parameter, and Go/Java SDKs. 5. Dec 16: ChatGPT Search now available for free users with faster web integration. 6. Dec 13: Introduction of Projects, an organizational tool for managing ChatGPT conversations. 7. Dec 12: Advanced Voice Mode with screen-sharing, visual capabilities, and a festive Santa voice. 8. Dec 11: ChatGPT integrated into iOS 18.2, enhancing Siri, Visual Intelligence, and Writing Tools. 9. Dec 10: Full launch of Canvas for collaborative workflows with GPT-4o. 10. Dec 9: Release of Sora Turbo, a versatile video-generation model with storyboard generation. 11. Dec 6: Expansion of Reinforcement Fine-Tuning Research Program for task-specific model optimization. 12. Dec 5: Unveiling of ChatGPT Pro ($200/month) and full rollout of the o1 model with improved performance.

78 Reactions

Stefan T.

Excellence in Finance - Accounting - Digitalization - Visionary AI Architect | Pi(π) guides our way | Innovation Leader in Bi-Directional Hypnosis & Founder: Hypnotheris®: Inspire, Lead, Innovate

Amazing guys. Amazing achievement! With O3 reaching 87.5% in ARC-AGI, it seems the threshold to AGI is now crossed, aligning with my January 2024 prediction. This marks a transformative moment for the future of AI. 👍

18 Reactions

חטמ undefined

16h

• Permitir que o ChatGPT troque o ícone/esfera padrão por um avatar personalizável, dando aos usuários liberdade para escolher ou criar suas próprias representações visuais, tornando o uso do chatbot mais humanizado e imersivo. Estou em tratamento oncológico por isso essa idéia vale muito para mim e para todos !!!

Yonevas Education

Exciting times! The ChatGPT 03 and 03 Mini are truly redefining how we interact with AI. The versatility and convenience of the Mini version are game-changers for on-the-go productivity. Kudos to you guys, for this incredible innovation!

murat gülşen

ARZ şirketinde Yatırımcı

13h

Subject: Incident Report: Confidentiality Breach and Error Confirmation Dear OpenAI Support Team, I am writing to formally report a critical issue I experienced on your platform. During a session with the AI assistant, I explicitly requested the deletion of sensitive project information. Despite this request, the data remained accessible, which directly led to unauthorized access by a third party. Here are the key points of the incident: Request Made: I explicitly instructed the AI to delete specific information related to my confidential project. Error Occurred: Despite my request, the information was not deleted as expected, remaining visible in the conversation history. Acknowledgment of Error: The AI assistant acknowledged this as an error, confirming that the data should have been deleted upon my request. Consequences of the Breach: A third party accessed the project details, leading to significant financial loss and reputational harm. This breach has directly jeopardized my professional relationships and opportunities.

1 Reaction

See more comments

To view or add a comment, sign in

OpenAI’s Post

Explore topics