Sayed Raheel Hussain’s Post

View profile for Sayed Raheel Hussain, graphic

ML Engineer | AI Researcher | Generative AI & LLMs | Computer Vision & Data Science

🚀 NVIDIA launches Llama-3.1-Nemotron-70B! https://lnkd.in/efahSbKW 📊 As of Oct 1, 2024, it's topping the charts: • Arena Hard: 85.0 • AlpacaEval 2 LC: 57.6 • MT-Bench: 8.98 🔑 Key points: • Outperforms other models across multiple benchmarks • Longer responses (avg 2199.8 characters) • Consistent performance (narrow confidence interval) 𝗛𝗲𝗿𝗲 𝗮𝗿𝗲 𝘀𝗼𝗺𝗲 𝗶𝗻𝗳𝗼𝗿𝗺𝗮𝘁𝗶𝗼𝗻 𝗮𝗯𝗼𝘂𝘁 𝘁𝗵𝗲 𝗘𝘃𝗮𝗹 𝗺𝗲𝘁𝗿𝗶𝗰𝘀 𝘂𝘀𝗲𝗱 𝑨𝒓𝒆𝒏𝒂-𝑯𝒂𝒓𝒅 is a challenging AI benchmark created from real user queries on Chatbot Arena. It uses the BenchBuilder pipeline to select 500 high-quality, diverse prompts that test language models on complex, real-world tasks across various domains. The process involves: * Question Source: Real user queries from Chatbot Arena (initially 200,000) * Question Selection Process:    - BenchBuilder pipeline filters and evaluates queries    - AI (GPT-4-Turbo) scores questions on 7 key qualities:    - Specificity    - Domain knowledge    - Complexity    - Problem-solving    - Creativity    - Technical accuracy    - Real-world relevance    - Topic modeling clusters similar queries    - High-quality clusters are selected    - 500 challenging prompts sampled for final benchmark Evaluation uses an LLM-as-judge approach, comparing model outputs to a baseline. This method provides a more comprehensive, updatable, and cost-effective evaluation than previous benchmarks, better separating top models and aligning well with human judgments. Arena-Hard Paper: https://lnkd.in/eh_hK7NK Github: https://lnkd.in/e9qBbd3E #NVIDIA #AI #LLM #TechInnovation #LLM

  • graphical user interface, text, application

To view or add a comment, sign in

Explore topics