Amit Sharma’s Post

We've been playing with o-1 models for last few days (via APIs + ChatGPT), and absorbing the technical paper. Our initial view: * o-1 models are super impressive at times but disappointing at others. * These are preview versions, hope that the 'production' versions will be significantly more reliable. * Not an improvement in the ladder gpt3.5 - gpt4 - gpt4o, but a new direction / branch like gpt2 --> gpt3 was. The key question which tasks are suited to use o-1 type models? Given the costs and rate limits, these are definitely not suited for anything more than experimentation *at the moment*. Sam Altman said that this is an imp step to make Agentic frameworks possible. Its hard to understand what that really means but I'm assuming that's chaining prompts to execute on a series of tasks based on conditional logic (& reward answers that are objectively correct). Key questions: 1. How does model decide how much time to "think" at inference time? Since it doesn't allow RAG, Function Calling, Browsing, etc., all the "critical thinking" is basically reflecting upon its training "Chains of Thought". Does that mean more inference time will produce better results? 2. Now that "Step-by-Step" is not required in prompting, how should prompts be modified to get the best out of o-1 3. How to interpret the "Show your Thoughts"? OpenAI decided to hide the real CoT (for obvious reasons), and let the model 'summarize' those chains. So, this is like a human trying to "articulate" their "thoughts" in a way that average human being can understand. Is it possible to get more specific "articulation" if the end user is an expert, and can understand nuances much better? This will be huge in legal domain. 4. A bold claim from the technical paper: "Through training, the models learn to refine their thinking process, try different strategies, and recognize their mistakes." The last part is really intriguing. I've not be able to verify (or read a paper which has shown that the model recognized it going down a wrong path and then corrected itself) 5. Harvey has mentioned that its working with OpenAI to develop these models. What does that mean in terms of RLHF'ing transaction or litigation related legal tasks? OpenAI's announcements say that its specifically RLHF'd for coding, math, physics, chemistry, biology. All highly objective areas. (With extra emphasis on safety, red teaming, avoiding jailbreaking, etc.). One of their tests show that o-1 models are slightly worse off than gpt-4o in more subjective areas like personal writing or editing text? Where does that leave highly contextual tasks like transactional review? In our testing (similar prompts), we saw both o-1 mini and preview missing a 'Right of First Proposal' language as an aggressive clause for buyer in a supply agreement (which also has exclusivity clause btw) whereas gpt-4o got it right the first time. We'll share more examples later. #o-1 #llms #ai Richard Tromans Leonard Park

  • chart, bar chart
Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

2mo

I don’t actually know most of these answers yet, but initial impressions and information shared by OpenAI *suggests*: 1) OpenAI’s articles refer to having done their o1 benchmarking “on the maximal test-time compute setting.” Which suggests that there might be a hyperparameter responsible for model’s effort, which users may have the ability to manage. 2) Model creators have already been baking step-by-step into their finetuning such that much of the time it was already not necessary to include. But o1 has a few indicators that it needs more hand-holding on the prompt side. Whereas other frontier model training has “prepared the LLM for the path,” with o1, there are hints that we again need to “prepare the path for the LLM”. Complex prompts appear to diminish o1’s performance and OpenAI recommends trimming context to only the most relevant necessary for answering the question. 

Chandrashekhar Vattikuti

Senior Vice President (InMobi Telco Cloud)

2mo

Lovely. Keep posting updates Amit Sharma. Saves me a bunch of O-1 experimentation :-)

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics