Even when accurate context is retrieved, can LLMs extract the preferred answer when layers of conditions or constraints are applied across conversation turns? 🤔 🌟 Thrilled to share my recent work: RAD-Bench: Evaluating Large Language Models' Capabilities in Retrieval Augmented Dialogues, developed during my internship at MediaTek Research 聯發創新基地 ! With RAG, SAG, and tool-using becoming prevalent, it's crucial to evaluate how models handle progressively accumulated constraints and external information in context-rich scenarios. Existing benchmarks either assess LLMs' chat abilities in multi-turn dialogues or their use of retrieval for augmented responses in single-turn settings. To bridge the gap, we proposed RAD-Bench, which to the best of our knowledge is the first benchmark to assess LLMs' ability to follow user instructions in multi-turn dialogues while effectively utilizing retrieved context. Our benchmark focuses on two key abilities: Retrieval Synthesis and Retrieval Reasoning. We constructed a pipeline using LLMs to generate, select, and synthesize synthetic questions and retrieved contexts, resulting in 89 high-quality multi-turn samples (267 total turns). We evaluated popular LLMs like GPT-4, Llama, Gemma, Mistral, Deepseek, and Breeze using LLM-as-a-Judge with tailored prompts. Our findings show that as conditions accumulate, models struggle more to identify key information from context. Comparisons with Chatbot Arena reveal that RAD-Bench effectively distinguishes LLM performance in context-rich, augmented dialogues, demonstrating that models with similar performance in regular multi-turn conversations may differ in retrieval-augmented scenarios. For detailed insights, check our paper on arXiv: https://lnkd.in/gKT8ACze Heartfelt thanks to my mentor Feng-Ting Liao for guidance and discussions, Mu-Wei Hsieh for additional experiments and result visualization, and Mark for paper revisions. Your contributions were instrumental in bringing this work to fruition! #LLM #MultiTurn #RAG #Benchmark #Research #Internship
Excellent work!
Generative AI, Computer Vision & Machine Learning Researcher | NTU EE
2moVery informative