George Z. Lin’s Post

View profile for George Z. Lin, graphic

Navigating the AI landscape! 🤖🚀💼🌐 AI Leader, Investor, & Advisor | MassChallenge | Wharton VentureLab

HKU and teams at Tencent research are tackling the evaluation problems of language models with a proposed reasoning benchmark. Plot2Code is a benchmark designed to evaluate the capabilities of Multi-modal Large Language Models (MLLMs) in generating executable code from scientific plots, addressing a previously unexplored area. It comprises a curated dataset of 132 high-quality matplotlib plots, spanning six distinct types, each paired with its source code and a descriptive instruction generated by GPT-4. This setup allows for a comprehensive evaluation of MLLMs' coding capabilities across various input modalities.   The benchmark employs three automatic evaluation metrics: code pass rate, text-match ratio, and GPT-4V overall rating, providing a nuanced assessment of the generated code and the fidelity of the rendered images to the original plots. Unlike traditional binary pass/fail evaluations, these metrics offer a detailed judgment of the output, aligning closely with human evaluations.   Evaluating 14 MLLMs, including proprietary models like GPT-4V and open-source models such as Mini-Gemini, Plot2Code reveals significant challenges, particularly with text-dense plots where most MLLMs heavily rely on textual instructions. This highlights substantial room for improvement in visual coding tasks.   Plot2Code's contributions are multifaceted. It provides a robust benchmark for evaluating MLLMs' visual coding capabilities, setting a new standard for future developments. The benchmark's design, emphasizing a diverse range of plot types and evaluation metrics, ensures a comprehensive assessment that can guide the enhancement of MLLMs. The open availability of the Plot2Code dataset encourages ongoing research and development, offering a valuable resource for the AI community.   The benchmark supports two evaluation settings: Direct Asking, where the model generates code from an image, and Conditional Asking, where the model generates code based on an image and additional textual instructions. These settings allow for a thorough evaluation of MLLMs' performance across different input modalities.   Plot2Code underscores the complexity of visual coding tasks and the need for further advancements in MLLMs. It aims to drive future research in multi-modal reasoning, text-dense image understanding, and complex code generation, paving the way for more intelligent and versatile multi-modal systems.  Arxiv: https://lnkd.in/e2MkTxbD

  • table

To view or add a comment, sign in

Explore topics