Excited to share insights on a groundbreaking paper titled "#Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources" by Alisia Lupidi and colleagues. They introduced Source2Synth, a method that helps large language models (#LLMs) learn better by creating synthetic data based on real-world sources like Wikipedia articles and web tables. This approach improves AI performance on complex tasks without needing expensive human-made data. Key points: 🚀 Creating Synthetic Data from Real Sources: Using actual data to make artificial examples that are realistic and accurate. 🧠 Including Step-by-Step Reasoning: The synthetic data includes detailed reasoning steps, helping AI models learn to solve problems more effectively. 🛠️ Ensuring High-Quality Data: Filtering out low-quality examples to ensure the AI learns from the best possible data. 📈 Significant Performance Improvements: This method led to a 22.57% improvement in multi-hop question answering and a 25.51% boost in answering questions using tables. Read more about it here: 👉 https://lnkd.in/graYZifT This work is a big step forward in making AI models smarter and more efficient without relying heavily on human annotations. Congratulations to the team for this remarkable achievement! #AI #MachineLearning #ArtificialIntelligence #DataScience #Innovation
DrinkData’s Post
More Relevant Posts
-
Table-based reasoning with large language models (LLMs) is a promising direction to tackle many table understanding tasks, such as table-based question answering and fact verification. Compared with generic reasoning, table-based reasoning requires the extraction of underlying semantics from both free-form questions and semi-structured tabular data. Chain-of-Thought and its similar approaches incorporate the reasoning chain in the form of textual context, but it is still an open question of how to effectively leverage tabular data in the reasoning chain. We propose the Chain-of-Table framework, where tabular data is explicitly used in the reasoning chain as a proxy for intermediate thoughts. Specifically, we guide LLMs using in-context learning to iteratively generate operations and update the table to represent a tabular reasoning chain. LLMs can therefore dynamically plan the next operation based on the results of the previous ones. This continuous evolution of the table forms a chain, showing the reasoning process for a given tabular problem. The chain carries structured information of the intermediate results, enabling more accurate and reliable predictions. Chain-of-Table achieves new state-of-the-art performance on #WikiTQ, #FeTaQA, and #TabFact benchmarks across multiple #LLM choices. #CoT #GenAI #GoogleResearch
Chain-of-Table: Evolves Tables in the LLM Reasoning Chain for Table Understanding
research.google
To view or add a comment, sign in
-
Getting fast and cheap quality assessment of your SFT data has never been easier. Use AI to help make better AI - the virtuous LLM performance improvement cycle.
Replacing human QA (Quality Assessment) processes for SFT data assessment with a SOTA model — a…
medium.com
To view or add a comment, sign in
-
Getting good training data is most critical for the model quality. This is by far one of the best papers I have read this year where they generate high quality training data - its like DPO bypassing need for RLHF. We know that MCTS (Monte Carlo Tree Search) was a big reason for Alphazero to find solutions. Researchers are now bringing in those ideas to LLM fine-tune as described in a excellent paper "AlphaMath Almost Zero: process Supervision without process". Recent advancements in large language models (LLMs) have substantially enhanced their mathematical reasoning abilities. However, these models still struggle with complex problems that require multiple reasoning steps, frequently leading to logical or numerical errors. While numerical mistakes can largely be addressed by integrating a code interpreter, identifying logical errors within intermediate steps is more challenging. Moreover, manually annotating these steps for training is not only expensive but also demands specialized expertise. In this paper, they introduce an innovative approach that eliminates the need for manual annotation by leveraging the Monte Carlo Tree Search (MCTS) framework to generate both the process supervision and evaluation signals automatically. Essentially, when a LLM is well pre-trained, only the mathematical questions and their final answers are required to generate our training data, without requiring the solutions. They proceed to train a step-level value model designed to improve the LLM's inference process in mathematical domains. Their experiments indicate that using automatically generated solutions by LLMs enhanced with MCTS significantly improves the model's proficiency in dealing with intricate mathematical reasoning tasks. Paper: https://lnkd.in/gqdnVNKG #llms #genai #generatieveai #capgemini #llm #opensourceai #capgeminiindia #ai #artificialintelligence #software #leaderboard #benchmark #benchmarking
2405.03553
arxiv.org
To view or add a comment, sign in
-
🚀 𝗘𝘅𝗽𝗹𝗼𝗿𝗶𝗻𝗴 𝗥𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹-𝗔𝘂𝗴𝗺𝗲𝗻𝘁𝗲𝗱 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 (𝗥𝗔𝗚)🤖 In the world of language models (llms), retrieval-augmented generation (rag) offers a game-changing approach by enhancing text generation with external information, without the need for time-consuming fine-tuning. 🧠💡 🔍 𝗥𝗔𝗚 𝗔𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲: 1️⃣ Data embedding: data is first embedded into vectors and stored in a vector database (one-time process). 2️⃣ Query embedding: user queries are embedded using the same model. 3️⃣ Similarity matching: the query’s embedded vector is matched with the closest documents in the database. 4️⃣ Response generation: the query + retrieved documents are fed to the llm to generate a rich, context-aware response. 💡 Key benefits: ✅ No fine-tuning needed: save computational resources by using pre-trained models. ✅ Efficient data handling: quickly adapt to new data by updating embeddings. ⚠️ Challenges: ❌ Limited to certain tasks: works best for question-answering but struggles with tasks requiring full document context (e.g., summarization). ❌ Matching limitations: similarity-based matching may not fully capture the depth of a query or document. 🔑 In a Nutshell: RAG is a powerful tool for augmenting llms with external knowledge in a resource-efficient way. it's perfect for lookup-based tasks but may not be ideal for more complex tasks requiring a deeper understanding of documents. 📚 #ai #machinelearning #rag #languagemodels #datascience #aitechnology #artificialintelligence #naturallanguageprocessing #techinnovation #deeplearning #data #ml #aitrends #knowledgegraph #techcommunity #llm #neuralnetworks
To view or add a comment, sign in
-
There's been a lot of buzz around the word 'Fine-tuning'. What exactly is it? Let me try to make it easier for you. Fine-tuning Large Language Models (LLMs) involves retraining a pre-existing model on a smaller, domain-specific dataset. It's significant for saving computational resources and time compared to training from scratch. So, basically it is used to improve the model performance and overcome the challenges faced by LLMs. What are these challenges? - Knowledge Cutoff - knowledge limited due to being trained on specific data. - Hallucination - generating incorrect or misleading information. - Black box Model - No proper interpretation of the model's outcome. So the solutions to these problems are: 1) Prompt Engineering - providing examples in the form of prompts to guide the model's generation process. 2) Finetuning - Adjusting the parameters and training data of pre-trained LLMs for specific tasks or domains, enhancing accuracy and relevance for targeted applications. 3) RAG (Retrieval Augmented Generation) - retrieves relevant knowledge from a knowledge base, resulting in more informed and contextually rich outputs. Choosing any one out of the 3 solutions, depends on the specific requirements and objectives of the application. Fine-tuning is particularly effective for modifying the behavior, writing style, or domain-specific knowledge of a model, especially when abundant, labeled training data specific to the domain is available, enabling a more tailored and precise adjustment of the model's behavior. For example, imagine you have a pre-trained LLM designed for general language understanding, but you want to use it to generate product descriptions for an online marketplace. By fine-tuning the model with data specific to product descriptions, such as features, specifications, and customer reviews, you can enhance its ability to generate accurate and compelling descriptions tailored to the products on your platform. Hope this short article helped you! Stay tuned for more.
To view or add a comment, sign in
-
Simple Retrieval-Augmented Generation (RAG) workflow, illustrating the process of transforming structured and unstructured data into responses using text embedding models and large language models: Structured and Unstructured Data-The process starts with collecting data from various sources. Chunks-The data is divided into smaller, manageable pieces. Vector DB (Embeddings)- These chunks are converted into numerical vectors (embeddings) by a Text Embedding Model and stored in a Vector Database. Retrieved Chunks- Relevant chunks are retrieved from the Vector DB based on a query. Response Generation- A Large Language Model processes these chunks to generate a coherent response. This workflow combines data embedding and language models to improve the relevance and accuracy of information retrieval and response generation.
To view or add a comment, sign in
-
𝗜𝘀 𝗥𝗔𝗚 𝗗𝗲𝗮𝗱? 𝗘𝘅𝗽𝗹𝗼𝗿𝗶𝗻𝗴 𝘁𝗵𝗲 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻 𝗼𝗳 𝗟𝗮𝗿𝗴𝗲 𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗠𝗼𝗱𝗲𝗹𝘀 𝗛𝗶 𝗲𝘃𝗲𝗿𝘆𝗼𝗻𝗲! With advancements in large language models (LLMs) like Gemini, many wonder if Retrieval-Augmented Generation (RAG) is still necessary. RAG remains crucial for handling large datasets, real-time updates, and cost-effective applications, while long context models excel in single-document analysis and summarization. 𝗞𝗲𝘆 𝗣𝗼𝗶𝗻𝘁𝘀: 𝗥𝗔𝗚: Adds relevant info to LLM prompts from external sources, great for private data and unseen topics. 𝗦𝗽𝗲𝗲𝗱 & 𝗔𝗰𝗰𝘂𝗿𝗮𝗰𝘆: RAG is fast, accurate, and reduces noise by selectively including relevant information. 𝗟𝗼𝗻𝗴 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗠𝗼𝗱𝗲𝗹𝘀: Can process up to 2 million tokens for comprehensive text analysis but are slower and costlier. 𝗖𝗼𝗻𝗰𝗹𝘂𝘀𝗶𝗼𝗻: Both RAG and long context models have their unique strengths. RAG is ideal for real-time data integration, while long context models are best for in-depth document analysis. 𝗙𝗼𝗿 𝗺𝗼𝗿𝗲 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻, 𝗳𝗼𝗹𝗹𝗼𝘄 𝗺𝗲 𝗼𝗻 𝗛𝗮𝘀𝗵𝗻𝗼𝗱𝗲 𝗮𝗻𝗱 𝗿𝗲𝗮𝗱 𝗺𝘆 𝗯𝗹𝗼𝗴. 𝗜’𝗺 𝘀𝗵𝗮𝗿𝗶𝗻𝗴 𝗱𝗮𝗶𝗹𝘆 𝘂𝗽𝗱𝗮𝘁𝗲𝘀 𝗼𝗻 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝘃𝗲 𝗔𝗜! #AI #GenerativeAI #MachineLearning #RAG #LLM #DataScience #TechTrends #AIResearch #FutureOfAI #TechInnovation #Hashnode #FollowMe
Is Retrieval-Augmented Generation (RAG) Dead? Exploring RAG vs. Long C
abdullahfarooq.hashnode.dev
To view or add a comment, sign in
-
An interesting piece of work on using LLMs for text classification, to appear at LREC-COLING 2024 next week. Prompting can give a start (say a Solution 0), but is not comparable to fine-tuning even for "simple" problems like text classification. Of course, there are other ways LLMs can benefit text classification (e.g., for generating synthetic data to fine-tune - learning with synthetic examples, testing on real examples, not the other way round!) and the paper does not deal with that. So far, I felt few-shot learning approaches like SetFit (and more recently, FastFit) do better than in-context learning, based on what we have been working on, even when synthetic data was used for few-shot examples. Most of this kind of work is done on English datasets, so there is that scope for new insights there in future. https://lnkd.in/dwknf5Hd #nlproc #llms #textclassification #prompting
Language Models for Text Classification: Is In-Context Learning Enough?
arxiv.org
To view or add a comment, sign in
-
the importance of experimental design emerges. in the original data science and ml curriculum of the 1970s this was a required course, lost in wave of tons of python routines
In our new paper, we introduce a method for "Better Synthetic Data by Retrieving and Transforming Existing Datasets": https://lnkd.in/g-TyvcdT Obtaining task-specific training data is hard! Generating synthetic data with LLMs can be a solution, but this data is often low quality or lacking diversity. We think you should *transform existing datasets* instead of synthetically generating one from scratch. We introduce DataTune, which for a given task, autonomously retrieves relevant, publicly available datasets from HuggingFace and transforms them to meet the specific needs of target tasks. This effectively aligns the retrieved dataset with task requirements. For example, for “code description generation”, our system retrieves a dataset of code. The planning component predicts the need for a missing "description" field from this data! Our system then generates descriptions from the code, thus aligning this previously-unusable dataset. We tested DataTune on a diverse set of BIG-Bench tasks by finetuning language models (like Mistral 7B). None of these tasks were perfectly captured in existing training datasets, but transforming existing datasets yielded a 49% improvement over a few-shot prompting baseline! This was the excellent work of Ritu Gala and Saumya G., with advice from Vijay Viswanathan, Sherry Tongshuang Wu, and me. The code is available in the prompt2model toolkit if you'd like to try it out: https://lnkd.in/gev6iXME
Better Synthetic Data by Retrieving and Transforming Existing Datasets
arxiv.org
To view or add a comment, sign in
-
An Introduction to Prompting for LLMs - Large Language Models How do we communicate effectively with LLMs? LLMs have become ubiquitous, with new models being released almost daily. They’ve also been made more accessible to the general public, thanks to a thriving open-source community that has played a crucial role in reducing memory requirements and developing efficient fine-tuning methods for LLMs, even with limited compute resources. One of the most exciting use cases for LLMs is their remarkable ability to excel at tasks they were not explicitly trained for, using just a task description and, optionally, a few examples. You can now get a capable LLM to generate a story in the style of your favorite author, summarize long emails into concise ones, and develop innovative marketing campaigns by describing your task to the model without needing to fine-tune it. But how do you best communicate your requirements to the LLM? This is where prompting comes in. An Introduction to Prompting for LLMs | by Anand Subramanian | Towards Data Science -https://lnkd.in/gZgRzUXs
To view or add a comment, sign in
257 followers