Chong Yu’s Post

Leading high-performing teams to create innovative solutions

1mo

What does the new MLE-bench tool from OpenAI offer AI developers in evaluating machine-learning engineering capabilities? "The new benchmarking tool from OpenAI does not specifically address concerns about the future of AI engineering systems, but it opens the door to developing preventative tools." OpenAI's MLE-bench tool assesses AI agents on 75 real-world Kaggle tasks, enabling evaluation of AI engineering performance.

OpenAI unveils benchmarking tool to measure AI agents' machine-learning engineering performance

techxplore.com

To view or add a comment, sign in

More Relevant Posts

Nigel I. Oliveira, PMP

PMP since 2003. Two decades of Problem Solving through Program/Project leadership. Valuable insights/impacts re: complex Health, Technology, Resources, Education, and Architecture cross-discipline initiatives.
3mo
Report this post
'Project-Driven Problem Solving' FREE TIP: "Set The Bar HIGH". Referring to 'writing AI queries' as 'engineering' was lowering the bar. Anyone who knows what it takes to become an actual engineer (electrical, civil, chemical, or what-have-you) knows it's not even the same ballpark. And just like that...the entire 'prompt engineering' industry will be gone. Let's aim for Excellence in everything we do, including celebrating experience-based expertise...which takes time, patience, and a willingness to learn...and sometimes fail...and learn again. *As always, this post is written entirely by a human. Please feel free to share FREE TIPS, and follow/connect to help change the world for the better*. #engineering #science #technology #ai #problemsolving #excellence #creativity #projectmanagement #education #chatgpt
Andrejs Karpovs

AI Generalist | Helping you master new skills using AI | Head of Oracle Cloud & Oracle GenAI @Tietoevry
3mo

Here's my take on new model from OpenAI: It's another nail in prompt "engineering" coffin. From OpenAI's official prompting guide: Model performs best with straightforward prompts. Some prompt engineering techniques, like few-shot prompting or instructing the model to "think step by step," may not enhance performance and can sometimes hinder it. — Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance. — Avoid chain-of-thought prompts: Since these models perform reasoning internally, prompting them to "think step by step" or "explain your reasoning" is unnecessary. You heard it right, there's 0 prompt engineering needed. This has arrived sooner than I thought. Don't get me wrong - this is a new model. Not a replacement for all previous models. It only excels at tasks where reasoning is needed. But it gives a glimpse of what to expect in future. Do you think we'll still need any form of prompt "engineering" by the time GPT-5 is released? --- Follow Andrejs for more on AI
Like Comment
To view or add a comment, sign in
Himanshu Gupta

Data Analytics Manager | Telecom MS | Leadership | Classical ML | Statistics | ML Mentor ~ 300+ Mentees | GenAI Enthusiast | PGDBM Marketing IMT Ghaziabad | PGD Business Analytics & Business Intelligence
2mo
Report this post
From this entire article one thing which rings the bell "MLE bench, a benchmark designed to assess how effectively AI agents can perform machine learning engineering tasks" https://lnkd.in/gxaN56ii

OpenAI Introduces Swarm, a Framework for Building Multi-Agent Systems

http://analyticsindiamag.com

1 Comment
Like Comment
To view or add a comment, sign in
Jiten Patel

AI | Product
3mo
Report this post
OpenAI's new models o1-preview and o1-mini—are focused on advancing reasoning capabilities and tackling more complex tasks in areas like science, coding, and math. Although these models lack some features that GPT-4o boasts (like web browsing and file uploads), their strength lies in a more deliberate approach to problem-solving, mimicking human thought processes. For example, while GPT-4o struggled with the International Mathematics Olympiad test, the o1 model managed to solve 83% of the problems, a significant improvement. For developers, the o1-mini model stands out as a cost-efficient, faster alternative, excelling in coding tasks and being 80% cheaper than o1-preview. The ability to accurately debug and generate complex code with efficiency makes o1-mini a practical option for startups and businesses with tight budget. This shift towards more reasoning-focused models marks a step forward in AI's potential to go beyond mere text generation, setting the stage for AI to tackle more advanced use cases like scientific research and healthcare. For startups, this could mean reduced dependency on broader world-knowledge models and access to more specialised, cost-effective solutions. OpenAI's o1-series is a clear signal that the next frontier for AI is in sophisticated, domain-specific reasoning. While many AI models have traditionally focused on breadth, these advancements could position o1 models as the go-to for industries that require precision over general knowledge—offering startups a fresh, competitive edge by leveraging AI to solve real-world problems more efficiently.
Like Comment
To view or add a comment, sign in
Azhar Zuberi
2mo
Report this post
I continue to be amazed by how fast OpenAI is pushing the boundaries of innovation. Here are three recent updates that have had a significant impact on our work: 𝟭. 𝗙𝗶𝗻𝗲-𝗧𝘂𝗻𝗶𝗻𝗴 𝘄𝗶𝘁𝗵 𝗩𝗶𝘀𝗶𝗼𝗻: We've been working extensively on fine-tuning GPT-4 to produce highly specific results that couldn't be achieved with prompt engineering alone. Earlier this week, I was discussing with Chad Smith how there was no way to programmatically fine-tune using images. And the very next day, OpenAI launched Vision capabilities for the fine-tuning API (check it out here: https://lnkd.in/gsvG_ZVc). Talk about timing! 𝟮. 𝗗𝗲𝗽𝗹𝗼𝘆𝗶𝗻𝗴 𝗔𝗜 𝗶𝗻 𝗥𝗲𝗴𝘂𝗹𝗮𝘁𝗲𝗱 𝗜𝗻𝗱𝘂𝘀𝘁𝗿𝗶𝗲𝘀: One of the challenges I've been grappling with is how to effectively deploy generative AI in regulated industries, such as healthcare. A use case that came up was medical transcriptions, which I assumed would require a controlled, on-prem solution - something OpenAI didn’t offer. I started evaluating other models to self-host, but couldn’t find a strong candidate. And then, in perfect synchronicity, Clem Delangue 🤗 CEO of Hugging Face announced that Whisper is now available under the Apache 2.0 license. This is a game-changer! 𝟯. 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗻𝗴 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲𝗱 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲𝘀: We’ve also been exploring ways to ensure the reliability of structured outputs. While we made progress through prompt engineering and fine-tuning, we still encountered some inconsistencies. OpenAI’s recent release of Structured Outputs in the API (August 2024) builds on JSON mode, extending its capabilities and enabling more predictable responses. This opens up a world of new possibilities. I can’t wait to see where we’ll be in another six months - or even a year. One thing’s for sure: it’s an incredible time to be builder!

1 Comment
Like Comment
To view or add a comment, sign in
Dan Vega

Spring Developer Advocate at Broadcom
3mo
Report this post
OpenAI's New o1 Models: Advancing AI Reasoning OpenAI has just introduced its o1 model series, featuring o1-preview and o1-mini. These models represent a significant step forward in AI's ability to tackle complex problems through enhanced reasoning. Let's explore what makes these models unique and how they might impact various fields. o1-preview: Thinking Deeper •Built-in Chain-of-Thought: The model now inherently uses step-by-step reasoning, improving consistency and sophistication in problem-solving. • Complex Problem Solver: Excels in science, math, and other fields requiring in-depth analysis. It's not replacing GPT-4o, but rather complementing it for specific tasks. • Thoughtful, but Slower: Takes more time to respond due to its thorough reasoning process. The trade-off? Higher quality outputs for complex tasks. • Resource Intensive: May use more tokens, potentially affecting costs for developers and enterprises. Current Limitations: No support for images, voice, or function calling yet. Primarily focused on text-based reasoning. o1-mini: A Developer's New Ally • Coding Optimized: Faster and more efficient for programming tasks compared to o1-preview. • Cost-Effective: 80% cheaper than o1-preview, making it more accessible for development work. • Balanced Performance: Offers enhanced reasoning with quicker responses, ideal for coding applications. What This Means for You 1. Researchers: Could accelerate discoveries by tackling complex scientific problems more effectively. 2. Developers: o1-mini offers sophisticated code generation and problem-solving, potentially boosting productivity and code quality. 3. Data Analysts: Enhanced reasoning abilities could lead to more nuanced data interpretations and solutions to complex problems. Key Takeaway If you're a developer exploring these new models, consider reaching for o1-mini. It's tailored for coding tasks, offering a sweet spot of improved reasoning with faster performance. For broader scientific and math problems, o1-preview might be your go-to. As AI evolves, tools like the o1 series expand the horizons of machine reasoning. They're opening up new possibilities for tackling challenges that once seemed out of reach. What potential do you see for these models in your field? How might they change your approach to complex problems? Share your thoughts! https://lnkd.in/gpHDb8Sz #OpenAI #ArtificialIntelligence #MachineLearning #TechInnovation

Introducing OpenAI o1

openai.com
Like Comment
To view or add a comment, sign in
Andrejs Karpovs

AI Generalist | Helping you master new skills using AI | Head of Oracle Cloud & Oracle GenAI @Tietoevry
3mo
Report this post
Here's my take on new model from OpenAI: It's another nail in prompt "engineering" coffin. From OpenAI's official prompting guide: Model performs best with straightforward prompts. Some prompt engineering techniques, like few-shot prompting or instructing the model to "think step by step," may not enhance performance and can sometimes hinder it. — Keep prompts simple and direct: The models excel at understanding and responding to brief, clear instructions without the need for extensive guidance. — Avoid chain-of-thought prompts: Since these models perform reasoning internally, prompting them to "think step by step" or "explain your reasoning" is unnecessary. You heard it right, there's 0 prompt engineering needed. This has arrived sooner than I thought. Don't get me wrong - this is a new model. Not a replacement for all previous models. It only excels at tasks where reasoning is needed. But it gives a glimpse of what to expect in future. Do you think we'll still need any form of prompt "engineering" by the time GPT-5 is released? --- Follow Andrejs for more on AI
95 Comments
Like Comment
To view or add a comment, sign in
Daiju Hoshino

GenAI Evangelist, Department Chair and Instructor of Biology at Tarrant County College
3mo
Report this post
Generative AI example use of the day! (Google NotebookLM) OpenAI just announced their latest model o1-Preview (https://lnkd.in/gTBDrdaz)! I am really excited to try their improved mathematical/scientific analytical/evaluation components! However, I got so many other things on my current plate, and I couldn't really look into the actual research/article page. I could have tried "NapkinAI" or "Mapify," but I have used up "free" quotes. So, I decided to use Google NotebookLM to summarize and ask the question that I had. So, here is what I did: Step 1: Open Google NotebookLM: https://lnkd.in/g6y6rN4N Step 2: I chose "web URL" to let Google NotebookLM to analyze and summarise it. Step 3: I asked "In what area has this newest model improved and in what area this newest model is not surpassing the GPT-4o model?" Here is the output: OpenAI o1 Model: Improved Reasoning, Limited Natural Language Performance The newest OpenAI model, o1, demonstrates significant improvement over GPT-4o in areas requiring complex reasoning, but falls short in certain natural language tasks. Here's a breakdown: Areas of Improvement: ● Reasoning and Problem-Solving: o1 exhibits advanced reasoning abilities, outperforming GPT-4o in a variety of challenging benchmarks. These include: ○ Competitive Programming: o1 ranks in the 89th percentile on Codeforces problems, significantly surpassing GPT-4o's 11th percentile ranking12. ○ Mathematics: o1 excels in math, placing among the top 500 students in the US on the American Invitational Mathematics Examination (AIME) and exceeding GPT-4o's performance by a large margin13. ○ Science: o1 surpasses human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA), while GPT-4o lags behind1. ○ Logical Reasoning and Coding: Examples provided in the source material showcase o1's superior ability to understand complex patterns, decode ciphertext, and write accurate bash scripts compared to GPT-4o4567.... ● Chain of Thought Reasoning: o1 utilizes a "chain of thought" process, similar to human reasoning, which enables it to break down complex problems into smaller steps, recognize errors, and try different approaches1112. ● Safety and Alignment: The chain of thought reasoning also enhances o1's safety and alignment with human values by allowing for better integration of safety policies and enabling observation of the model's thought process1314. Area Where o1 Doesn't Surpass GPT-4o: ● Natural Language Tasks: While o1 excels in reasoning-heavy tasks, it does not outperform GPT-4o in all areas. Specifically, human preference evaluations indicate that o1 is not preferred for some natural language tasks1315. The sources do not specify which types of natural language tasks o1 struggles with. This saved my time to have more thoughts and ideas of what I can try and how I can effectively use this new model!

Introducing OpenAI o1

openai.com
Like Comment
To view or add a comment, sign in
Bagirthi Bharath Reddy

AI,ML and Gen AI Engineer/Researcher | 5+ years in AI/ML, NLP, LLMs & Prompt Engineering | Expertise in Python & Generative AI | Proficient in Azure & AWS | AWS Certified ML Specialist | Ex-Infosys| Follow For AI Updates
2mo
Report this post
🚀 OpenAI’s MLE-bench: AI vs. Human Data Scientists 🧠🤖 OpenAI has introduced MLE-bench, a new tool to benchmark AI performance against human data scientists on 75 real-world tasks from Kaggle. Here’s what stood out: 🔹 Performance Highlights: The o1-preview model, in combination with the AIDE framework, delivered results comparable to skilled human data scientists in 17% of competitions. 🔹 Challenges for AI: While AI shines in specific scenarios, tasks demanding creativity and adaptability remain areas where humans excel. 💡 What’s Next? This launch raises an exciting question: How will the synergy between AI and human expertise shape the future of data science? The road ahead promises powerful collaborations, with AI augmenting human capabilities to unlock new possibilities. https://lnkd.in/gyEs66M3 #AI #DataScience #OpenAI #HumanVsAI #FutureOfWork #AIandHumans

2410.07095

arxiv.org
Like Comment
To view or add a comment, sign in
Sanjay Katyal

Financial Services / Strategic Transformation Leader | Business Transformation
6mo Edited
Report this post
I've completed the "Working with the OpenAI API" course on DataCamp. This course has equipped me with practical skills in AI, enabling me to leverage the OpenAI API for innovative solutions. Throughout the course, I delved into understanding OpenAI API models, crafting effective prompts, and implementing API calls efficiently—skills that are essential for today's AI-driven technology landscape. I'm excited to apply this knowledge to leverage the power of the OpenAI API to develop solutions to real-world business transformation projects and explore new opportunities in AI and machine learning. If you're working on something exciting or have tips to share, feel free to connect! #OpenAI #MachineLearning #CareerDevelopment #AI

sanjay katyal's Statement of Accomplishment | DataCamp

datacamp.com

1 Comment
Like Comment
To view or add a comment, sign in

736 followers

View Profile Follow

Chong Yu’s Post

OpenAI unveils benchmarking tool to measure AI agents' machine-learning engineering performance

techxplore.com

More from this author

Developer’s Journey with GenAI.. A Current Set of Principles

No, You Don't Need Blockchain

Keeping the Beer at a Perfect Warm Temperature with Raspberry Pi

Explore topics