HKU and teams at Tencent research are tackling the evaluation problems of language models with a proposed reasoning benchmark. Plot2Code is a benchmark designed to evaluate the capabilities of Multi-modal Large Language Models (MLLMs) in generating executable code from scientific plots, addressing a previously unexplored area. It comprises a curated dataset of 132 high-quality matplotlib plots, spanning six distinct types, each paired with its source code and a descriptive instruction generated by GPT-4. This setup allows for a comprehensive evaluation of MLLMs' coding capabilities across various input modalities. The benchmark employs three automatic evaluation metrics: code pass rate, text-match ratio, and GPT-4V overall rating, providing a nuanced assessment of the generated code and the fidelity of the rendered images to the original plots. Unlike traditional binary pass/fail evaluations, these metrics offer a detailed judgment of the output, aligning closely with human evaluations. Evaluating 14 MLLMs, including proprietary models like GPT-4V and open-source models such as Mini-Gemini, Plot2Code reveals significant challenges, particularly with text-dense plots where most MLLMs heavily rely on textual instructions. This highlights substantial room for improvement in visual coding tasks. Plot2Code's contributions are multifaceted. It provides a robust benchmark for evaluating MLLMs' visual coding capabilities, setting a new standard for future developments. The benchmark's design, emphasizing a diverse range of plot types and evaluation metrics, ensures a comprehensive assessment that can guide the enhancement of MLLMs. The open availability of the Plot2Code dataset encourages ongoing research and development, offering a valuable resource for the AI community. The benchmark supports two evaluation settings: Direct Asking, where the model generates code from an image, and Conditional Asking, where the model generates code based on an image and additional textual instructions. These settings allow for a thorough evaluation of MLLMs' performance across different input modalities. Plot2Code underscores the complexity of visual coding tasks and the need for further advancements in MLLMs. It aims to drive future research in multi-modal reasoning, text-dense image understanding, and complex code generation, paving the way for more intelligent and versatile multi-modal systems. Arxiv: https://lnkd.in/e2MkTxbD
George Z. Lin’s Post
More Relevant Posts
-
The recent paper “Better & Faster Large Language Models via Multi-token Prediction” by Meta proposes a novel approach of training large language models by predicting multiple tokens at once, as an alternative to the conventional next-token prediction. This approach promotes sample efficiency and performance improvement, particularly on coding benchmarks. Here are the key highlights from the paper: 1. Model Architecture: The model architecture comprises of a shared transformer trunk that produces a latent representation of the observed context, followed by “n” independent output heads, and finally a shared unembedding matrix. During training, the model is assigned with a task to predict the “n” future tokens in parallel. During inference, either the next-token output head alone can be used or all heads can be leveraged to expedite inference time. 2. Memory Efficiency: For current Large Language Models (LLMs), the vocabulary size is significantly larger than the latent representation dimension—therefore, logit vectors become the GPU memory usage bottleneck. The paper proposes a memory-efficient implementation where the forward and backward pass of each independent output head is computed sequentially. Gradients are then accumulated at the shared trunk, ensuring optimal batch size and efficient GPU memory utilization without sacrificing runtime. 3. Handling "Choice Points": When generating text, certain tokens act as "choice points," which are crucial for maintaining semantic coherence and relevance. Suboptimal predictions at these points can significantly impact the quality of the generated text. The multi-token prediction approach allows for lookahead, enabling the model to penalize deviations from the desired path heavily. This ensures better decision-making at choice points, thereby enhancing the overall quality of text generation.
To view or add a comment, sign in
-
🔍 LongRAG: Enhancing Generation through Information Retrieval using Long-Context Language Models - My Key Takeaways from the Paper 🤔 What is it? Researchers have developed LongRAG, a novel approach to open-domain question answering. LongRAG leverages large text blocks (about 4000 tokens) as search units instead of short passages and employs long-context language models for answer extraction. 💡 Why is it important? Traditional RAG systems rely heavily on precise retrieval of short relevant passages, placing a significant burden on the search component. The researchers aimed to balance the workload between search and information processing, while harnessing the capabilities of modern long-context language models. 🔧 How does it work? LongRAG consists of two main components: 1. "Long Retriever" - searches for large text blocks (4000+ tokens) based on the query, using embeddings and approximate nearest neighbor search. 2. "Long Reader" - a language model capable of processing long contexts (about 30,000 tokens) that extracts answers from retrieved texts. The researchers utilized off-the-shelf models like BGE for retrieval and Gemini or GPT-4 as readers without additional training. 📊 What are the results? LongRAG significantly improved performance on open-domain question answering tasks: - Achieved 62.7% accuracy on the Natural Questions dataset, comparable to top-performing trained RAG systems. - Reached 64.3% accuracy on HotpotQA (full-wiki), approaching state-of-the-art fully trained RAG models. Notably, LongRAG requires no special training and uses significantly fewer retrieval units (30 times fewer for Natural Questions), simplifying the search process. Paper: https://lnkd.in/du7A_ef6
To view or add a comment, sign in
-
Large Language Models LLMs for OCR Post-Correction Optical Character Recognition (OCR) converts text from images into editable data, but it often produces errors due to issues like poor image quality or complex layouts. While OCR technology is valuable for digitizing text, achieving high accuracy can be challenging and typically requires ongoing refinement. Large Language Models (LLMs), such as the ByT5 model, offer a promising potential for enhancing OCR post-correction. These models are trained on extensive text data and can understand and generate human-like language. By leveraging this capability, LLMs can potentially correct OCR errors more effectively, improving the overall accuracy of the text extraction process. Fine-tuning LLMs on OCR-specific tasks has shown that they can outperform traditional methods in correcting errors, suggesting that LLMs could significantly refine OCR outputs and enhance text coherence. In this context, a researcher from the University of Twente recently performed a new work to explore the potential of LLMs for improving OCR post-correction. This study investigates a technique that leverages the language understanding capabilities of modern LLMs to detect and correct mistakes in OCR outputs. By applying this approach to modern customer documents processed with the Tesseract OCR engine and historical documents from the ICDAR dataset, the research evaluates the effectiveness of fine-tuned character-level LLMs, such as ByT5, and generative models like Llama 7B. The proposed approach involves fine-tuning LLMs specifically for OCR post-correction. The methodology starts with selecting models suited for this task: ByT5, a character-level LLM, is fine-tuned on a dataset of OCR outputs paired with ground-truth text to enhance its ability to correct character-level errors. Additionally, Llama 7B, a general-purpose generative LLM, is included for comparison due to its large parameter size and advanced language understanding. Fine-tuning adjusts these models to the specific nuances of OCR errors by training them on this specialized dataset. Various pre-processing techniques, such as lowercasing text and removing special characters, are applied to standardize the input and potentially improve the models’ performance. The fine-tuning process includes training ByT5 in both its small and base versions, while Llama 7B is used in its pre-trained state to provide a comparative baseline. This methodology uses character-level and generative LLMs to enhance OCR accuracy and text coherence. The evaluation of the proposed method involved comparing it against non-LLM-based post-OCR error correction techniques, using an ensemble of sequence-to-sequence models as a baseline. The performance was measured using Character Error Rate (CER) reduction and precision, recall, and F1 metrics. The fine-tuned ByT5 base model with a context length of 50 characters achieved the best results on the custom dataset, reducing the CER by 5...
To view or add a comment, sign in
-
Unlocking Self-Optimizing LLM Apps: Harnessing DSPy and SELF-DISCOVER 🔴 Large language models (LLMs) has revolutionized the field of artificial intelligence, unlocking unprecedented potential for natural language understanding and generation. With their vast knowledge and ability to engage in complex reasoning, LLMs have the power to transform industries and augment human intelligence. However, harnessing the full potential of these models in real-world applications remains a significant challenge. One of the primary hurdles in developing robust LLM-based applications is the need for extensive prompt engineering. Crafting effective prompts that elicit the desired behavior from LLMs often requires a deep understanding of the model’s capabilities, limitations, and idiosyncrasies. This process can be time-consuming, iterative, and heavily reliant on human expertise. Moreover, as LLMs continue to evolve and new models emerge, prompts that work well for one model may not transfer seamlessly to another, leading to fragility and the need for constant adaptation. To address these challenges, researchers have been exploring ways to make LLM-based applications more robust, efficient, and adaptable. Two promising approaches that have emerged in recent months are the DSPy framework and the SELF-DISCOVER technique. DSPy, short for “Declarative Self-improving Language Programs in Python,” is a framework that aims to move beyond fragile prompting and towards robust programming with foundation models. Developed by researchers at Stanford NLP, DSPy introduces concepts like signatures, modules, teleprompters, and compilers to enable modular architecture design and automated prompt optimization. By abstracting away the complexities of prompt engineering, DSPy allows developers to focus on the high-level logic of their applications while the framework handles the low-level details of interfacing with LLMs. SELF-DISCOVER, on the other hand, is a technique that enables LLMs to self-discover task-specific reasoning structures. Developed by researchers at DeepMind, SELF-DISCOVER operates in a two-stage process: first, it discovers a reasoning structure for a given task by composing atomic reasoning modules, and second, it solves instances of the task using the discovered structure. By allowing LLMs to adapt their reasoning strategies to the unique requirements of each task, SELF-DISCOVER has the potential to boost performance and efficiency while producing interpretable and modular reasoning flows. While DSPy and SELF-DISCOVER have shown promising results independently, the true potential lies in combining these approaches to create self-optimizing LLM applications.
Unlocking Self-Optimizing LLM Apps: Harnessing DSPy and SELF-DISCOVER
medium.com
To view or add a comment, sign in
-
Large Language Models for Knowledge Graph Completion Class membership relations are critical components of knowledge graphs (KGs) that assign entities to given classes, and help KGs represent classification schemes which have significant effects on social policy and scientific consensus. Evaluation of these relations is necessary, but expensive as it is manually performed. This article evaluates the abilities of several large language models (LLMs) to evaluate class membership relations in a zero-shot manner. Article link: https://lnkd.in/gxH4pttf 🔹To evaluate KG relations, knowledge engineers must translate natural language from domain experts into symbolic representations, then check if the knowledge is properly reflected in a KG. LLMs can use natural language directly when given descriptions of an entity and a KG class, to predict if the entity is an instance of the class. 🔹The authors determined LLMs’ abilities to do this effectively by studying if LLMs could detect missing or incorrect relations. They constructed evaluation datasets from the Wikidata and CaLiGraph KGs and evaluated seven LLMs, including OpenAI’s GPT-4 and Meta’s Llama-2. 🔹It was found that LLMs and KGs were well-aligned, meaning KGs could effectively address knowledge gaps in LLMs and LLMs could similarly do so for KGs. The LLMs that were evaluated also displayed abilities to detect missing or incorrect KG relations. Moreover, when analysing the LLMs’ classification errors, it was found that 40.9% of the errors were attributed to the KGs themselves, which had missing or incorrect relations. A further 29.1% of errors were due to insufficient data in entity descriptions. 🔹LLMs have demonstrated good performance across many tasks, and now display their capabilities to assist knowledge engineers in KG evaluation and refinement as well. However, they still face some challenges like slow speed of processing tasks and costly API calls, which future efforts could target by using locally deployed LLMs and sampling approaches. Add additional points if needed, for instance, about limitations or problems. 📑 Allen, B. P., & Groth, P. T. (2024). Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models (Version 1). arXiv. DOI: 10.48550/ARXIV.2404.17000 📌 Ensuring the completeness and correctness of KGs has been a limiting factor in the widespread employment of KGs across applications. LLMs have previously been used to optimise KG construction, and now, they show their abilities to perform KG refinement as well, thus easing the burden of knowledge engineers.
Evaluating Class Membership Relations in Knowledge Graphs using Large Language Models
arxiv.org
To view or add a comment, sign in
-
The article will explore how knowledge graphs encode relationship perspectives unavailable in basic text snippets, and why this provides superior augmentation.
Enriching Language Models with Knowledge Graphs for Powerful Question Answering 🌲 Retrieval-augmented generation (RAG) has emerged as a vital technique to enhance large language models (LLMs). By providing external context, RAG helps ground the LLM’s generated text in real factual information rather than risking unsupported hallucinations. This context typically comes from retrieving relevant text segments from an indexed database using vector similarity search. However, this traditional vector index lookup approach has limitations. Simple word vector similarities often cannot fully capture nuanced connections and relationships within complex real-world data. As such, the isolated text snippets tend to provide narrow, superficial context that restricts the LLM’s understanding and reasoning. Recent innovative methods have explored using knowledge graphs rather than raw text segments as the contextual augmentation source in the RAG framework. Knowledge graphs provide structured representations of entities, their attributes, and labeled relationships between them. Constructed from source corpora, they encode higher-level abstraction of key semantic concepts and dependencies. Feeding LLM text generation algorithms with carefully tailored subgraphs contextualizing the query has shown substantial improvements on tasks requiring deeper reasoning, explanation, and reduction of factually inaccurate hallucinations. The rich metadata encapsulated in graph form unlocks more powerful contextual connections than isolated text segments can offer. This article analyzes two recent techniques taking knowledge graph-centered approaches to retrieval augmentation of language models — GraphRAG from Microsoft and G-Retriever from Xiaoxin He et al. 2024 It also proposes opportunities for chaining these methods together to further advance context-aware neural text generation. The article will explore how knowledge graphs encode relationship perspectives unavailable in basic text snippets, and why this provides superior augmentation. Details on the graph construction, retrieval, and neural integration components will analyze how these techniques practically enhance LLMs with structured data. Finally, potential combinations of GraphRAG, G-Retriever, and iterative RAG will be discussed as directions for unlocking even more sophisticated contextual reasoning. https://lnkd.in/eHxnvdaq
Enriching Language Models with Knowledge Graphs for Powerful Question Answering
medium.com
To view or add a comment, sign in
-
Improving Retrieval Augmented Language Model with Self-Reasoning The paper proposes a novel self-reasoning framework to enhance the performance, reliability, and traceability of Retrieval-Augmented Language Models (RALMs). The core idea is to use reasoning trajectories generated by the LLM itself. The framework comprises three main processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process. The framework has been evaluated on four public datasets, demonstrating superior performance compared to existing models and achieving comparable results to GPT-4 with significantly fewer training samples. Key Findings Only : 🔹 Novel Framework: ◾ Introduced a self-reasoning framework that improves the reliability and traceability of Retrieval-Augmented Language Models (RALMs) by leveraging reasoning trajectories generated by the LLM itself. 🔹Three Key Processes: ◾ Relevance-Aware Process (RAP): Judges the relevance of retrieved documents and provides reasons ◾ Evidence-Aware Selective Process (EAP): Selects key sentences from relevant documents and explains their relevance. ◾ Trajectory Analysis Process (TAP): Consolidates reasoning trajectories to generate a final answer. 🔹Performance Improvement: ◾ The self-reasoning framework outperformed existing state-of-the-art models in both accuracy and citation quality across four public datasets (NaturalQuestions, PopQA, ASQA, FEVER). ◾ Achieved comparable performance to GPT-4 while using significantly fewer training samples (only 2,000). 🔹Robustness to Noise: ◾ Demonstrated robustness to noisy retrieval and document order, maintaining performance even with shuffled or noisy document inputs. 🔹Efficient Training: ◾ Implemented a gradual training method with stage-wise masking strategies, enhancing model performance with fewer resources and less training data. 🔹Quality Control: ◾ Used automatic tools and filtering strategies to ensure high-quality training samples, improving the overall accuracy and reliability of the model. 🔹Human Evaluation: ◾ Human evaluation of citation quality aligned with automatic evaluation, validating the framework's ability to generate accurate and well-supported answers. 🔹Practical Applicability: ◾ The framework provides a scalable, efficient solution that enhances the interpretability and traceability of LLM outputs without relying on external models or tools. 💠 The self-reasoning framework enhances the robustness, reliability, and traceability of RALMs. Reference : https://lnkd.in/gjhGhkCC
To view or add a comment, sign in
-
The use of #RAG has expanded the possibilities of working with large text corpora. And, as you know, a correctly asked question is half the answer. Knowledge graphs compensate for the shortcomings of #LLM and avoid hallucinations and forgetting of long context. Knowledge graphs (#KG) allow to operate with high-level context, obtaining excellent results by structuring the chain of reasoning.
Enriching Language Models with Knowledge Graphs for Powerful Question Answering 🌲 Retrieval-augmented generation (RAG) has emerged as a vital technique to enhance large language models (LLMs). By providing external context, RAG helps ground the LLM’s generated text in real factual information rather than risking unsupported hallucinations. This context typically comes from retrieving relevant text segments from an indexed database using vector similarity search. However, this traditional vector index lookup approach has limitations. Simple word vector similarities often cannot fully capture nuanced connections and relationships within complex real-world data. As such, the isolated text snippets tend to provide narrow, superficial context that restricts the LLM’s understanding and reasoning. Recent innovative methods have explored using knowledge graphs rather than raw text segments as the contextual augmentation source in the RAG framework. Knowledge graphs provide structured representations of entities, their attributes, and labeled relationships between them. Constructed from source corpora, they encode higher-level abstraction of key semantic concepts and dependencies. Feeding LLM text generation algorithms with carefully tailored subgraphs contextualizing the query has shown substantial improvements on tasks requiring deeper reasoning, explanation, and reduction of factually inaccurate hallucinations. The rich metadata encapsulated in graph form unlocks more powerful contextual connections than isolated text segments can offer. This article analyzes two recent techniques taking knowledge graph-centered approaches to retrieval augmentation of language models — GraphRAG from Microsoft and G-Retriever from Xiaoxin He et al. 2024 It also proposes opportunities for chaining these methods together to further advance context-aware neural text generation. The article will explore how knowledge graphs encode relationship perspectives unavailable in basic text snippets, and why this provides superior augmentation. Details on the graph construction, retrieval, and neural integration components will analyze how these techniques practically enhance LLMs with structured data. Finally, potential combinations of GraphRAG, G-Retriever, and iterative RAG will be discussed as directions for unlocking even more sophisticated contextual reasoning. https://lnkd.in/eHxnvdaq
Enriching Language Models with Knowledge Graphs for Powerful Question Answering
medium.com
To view or add a comment, sign in
-
SentenceVAE: Supercharging Large Language Models with Sentence-Level Processing 📌Large language models (LLMs) have revolutionized natural language processing, but their reliance on token-by-token inference significantly hampers processing speed and efficiency. The SentenceVAE approach introduces a paradigm shift, enabling LLMs to process language at the sentence level, dramatically accelerating inference while improving accuracy and reducing memory usage. This innovation addresses a critical bottleneck in LLM performance, potentially paving the way for more responsive and resource-efficient AI systems. Article link: https://lnkd.in/g-mM8gNf 🔹 Current LLMs rely on next-token prediction for inference, generating text one token at a time. While this approach has led to impressive language generation capabilities, it creates a significant bottleneck in processing speed. As models grow larger and more complex, this token-by-token method becomes increasingly inefficient, limiting the real-time responsiveness of AI systems and requiring substantial computational resources. This inefficiency is particularly problematic for applications that demand rapid interactions or processing of large volumes of text. 🔹 SentenceVAE revolutionizes LLM inference by introducing sentence-level processing. It uses a Sentence Encoder to compress entire sentences into single tokens and a Decoder to reconstruct them. Integrated into LLMs, this creates Sentence-level LLMs (SLLMs) that process language more efficiently, handling longer contexts with fewer tokens. This innovation significantly reduces computational demands while maintaining or improving accuracy. SentenceVAE dramatically accelerates inference speeds, reduces perplexity, and decreases memory overhead, potentially transforming LLM performance and efficiency. 🔹 SentenceVAE significantly enhances LLMs (125M to 1.3B parameters), offering up to 365% faster inference, 54% lower perplexity, and 91% reduced memory usage. These scalable benefits, combined with improved context processing and adherence to the Scaling Law, position SentenceVAE to potentially revolutionize LLM deployment across various applications. 🔹 SentenceVAE marks a significant advancement in LLM efficiency, but challenges persist. Future research should explore scaling to larger models, multilingual applications, and integration with other techniques. While promising for improving context understanding in LLMs, further investigation is needed to assess its impact on complex tasks and cross-domain scalability. Nonetheless, SentenceVAE represents a crucial step towards more efficient and capable language models. An, H., Chen, Y., Qiao, X., Sun, Z., & Li, X. (2024). SentenceVAE: Faster, Longer and More Accurate Inference with Next-sentence Prediction for Large Language Models (Version 2). arXiv. DOI: 10.48550/ARXIV.2408.00655
SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context
arxiv.org
To view or add a comment, sign in
-
HyperCloning: Supercharging Large Language Models with Small Model Wisdom 📌 Training large language models (LLMs) from scratch has become increasingly expensive and time-consuming as models grow in size and complexity. This paper introduces HyperCloning, a novel technique that addresses this challenge by efficiently initializing larger language models using weights from smaller, pre-trained models. By expanding the hidden dimensions of transformer architectures while preserving functionality, HyperCloning offers a promising solution to accelerate LLM training and improve final model performance. Article Link: https://lnkd.in/g-3py6sZ 🔹 Training large language models from the ground up is extremely expensive and time-consuming, often necessitating millions of GPU hours. This reality limits experimentation and innovation, particularly for researchers with limited resources. While smaller models are less costly to train, they usually don't perform at the level of larger models, creating a difficult trade-off between cost and capability. 🔹 HyperCloning introduces a novel method to initialize larger language models using weights from smaller, pre-trained models. It expands the hidden dimensions of transformer architectures while preserving functionality, effectively transferring knowledge from the smaller model to the larger one. This approach enables faster convergence and better final accuracy, bridging the gap between the efficiency of small models and the performance of large ones. 🔹 Experiments with models ranging from 1.3B to 5.3B parameters demonstrated significant improvements using HyperCloning. The technique achieved 2-4x faster training convergence compared to random initialization, while also improving final model accuracy. Additionally, initializing with larger or more accurate base models led to even better performance, showcasing the method's scalability and effectiveness across different model sizes and architectures. 🔹 While HyperCloning shows promising results, some challenges remain. The method initially exhibits some catastrophic forgetting, though it's eventually overcome. Future research could focus on understanding and mitigating this effect. Additionally, exploring HyperCloning's applicability to even larger models and different architectures could further validate its potential to revolutionize LLM training, making it more accessible and sustainable for researchers and organizations with limited resources. Samragh, M., Mirzadeh, I., Vahid, K. A., Faghri, F., Cho, M., Nabi, M., Naik, D., & Farajtabar, M. (2024). Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization. arXiv. DOI: 10.48550/ARXIV.2409.12903å
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
arxiv.org
To view or add a comment, sign in