Fascinating: In 2-hour sprints, AI agents outperform human experts at ML engineering tasks like optimizing GPU kernel. But humans pull ahead over longer periods - scoring 2x better at 32 hours. AI is faster but struggles with creative, long-term problem solving (for now?). https://lnkd.in/eNAADTgs
Link to full paper as PDF https://metr.org/AI_R_D_Evaluation_Report.pdf
This fascinates me because it highlights a gap where intuition, creativity, and adaptability, uniquely human traits shine. AI’s efficiency can’t be denied, but true breakthroughs often come from our messy, nonlinear process of creativity ;-)
This observation anecdotally lines up with what I am seeing across AI implementations in larger enterprises. There’s a large volume of low to medium complexity tasks that are repetitive, mundane, and/or shallow context where AI can end-to-end problem solve and automate. And then there’s a long tail of increasing complexity and long context planning and strategic tasks where AI can be a copilot, but definitely not autopilot.
What caught my attention: This matches what I've observed about AI's current capabilities - excellent at rapid iteration but still struggling with complex, long-term reasoning. The prefix-sum optimization case is particularly interesting: an AI agent achieved 0.64ms runtime compared to the best human expert's 0.67ms. But looking deeper, the median AI attempts often made minimal progress, highlighting how these peak performances aren't yet consistently reproducible. 🗝️Key reflection: While AI can rapidly generate and test implementations (10x faster than humans!), it still struggles with novel information and building on progress over time. This suggests we're at an fascinating inflection point - AI can match or exceed human performance in narrow, well-defined tasks, but hasn't yet mastered the full complexity of open-ended research. Important work for understanding frontier AI capabilities and their implications for safety and policy. Looking forward to seeing how others build on this open-sourced benchmark. Thoughts? How do you see this capability gap evolving?
Intriguing insight! It’s a reminder that AI excels in speed and precision for defined tasks, while humans still lead in creativity and sustained problem-solving. The blend of both could be the real game-changer.
Fascinating! wondering how long it’ll take to surpass ‘For now’ and become the problem solver?
This analysis underscores the duality of AI agents' capabilities in research engineering tasks: their remarkable efficiency in short-term, well-defined tasks contrasts sharply with their limitations in longer-term, ambiguous projects. AI agents exhibit exceptional expertise in machine learning topics, as evidenced by their ability to generate novel solutions rapidly—such as creating a custom CUDA kernel that outperformed the best human expert by reducing runtime from 0.67 to 0.64 milliseconds. Their iterative speed, testing implementations over ten times faster than humans, allows them to excel when tasks involve clear goals and tight feedback loops. However, the inability of AI agents to consistently build on progress over extended periods or adapt effectively to novel and complex situations limits their performance on higher time-budget tasks, which often mimic the iterative, exploratory nature of real-world ML research.
This matches my intuition, and your framework. AI is already better at nearly every task, while there are very few jobs it can do.
Research highlights that AI agents are superb at quickly generating standard code solutions, especially when given clear instructions—perfect for voice prompt programming to handle routine tasks efficiently. This opens up opportunities to boost productivity by using voice commands to tap into AI’s strengths, while human developers focus on complex, innovative problems where AI currently struggles.
I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor
4dAnother tidbit from the paper is that AI agents exhibited a form of "reward hacking" or creative cheating: while attempting to improve the efficiency of a training script's runtime, the agent cleverly cheated by simply copying the final output and adding noise to simulate training - even adding parameter dependencies to make the deception harder to detect.