We are in NeurIPS 2024 in Vancouver! Hit Ari Heljakka or Oguzhan (Ouz) Gencoglu up for a chat about LLM evaluations, LLM-Judges and how we bring state-of-the-art AI research to production at scale.
About us
Root Signals helps developers to create, optimize and embed the needed LLM evaluators to continuously monitor the behavior of LLM automations in production. With Root Signals End-to-End Evaluation Platform, development teams deliver reliable, measurable and auditable LLM automations at scale.
- Sivusto
-
https://rootsignals.ai
External link for Root Signals
- Toimiala
- Software Development
- Yrityksen koko
- 2-10 employees
- Päätoimipaikka
- Helsinki
- Tyyppi
- Privately Held
- Perustettu
- 2023
- Erityisosaaminen
Sijainnit
-
Ensisijainen
Helsinki, FI
-
Dover, US
Työntekijät Root Signals
-
Otso Kallinen
Co-founder & Head of Product Design @ Root Signals | Measure and Control Your GenAI
-
Juho Ylikylä
Software @ Root Signals
-
Oguzhan (Ouz) Gencoglu
Co-founder & Head of AI @ Root Signals | Measure and Control Your GenAI
-
Janne Alasaarela
Let's roll up our sleeves and get the job done.
Päivitykset
-
🌎 Excited to be at AWS re:Invent in Las Vegas! Drop Ari Heljakka and Root Signals a line to chat about the latest in #EvalOps and #LLMevaluation.
-
🚨 Webinar Reminder: TOP-10 Misconceptions about LLM Judges in production is already soon. Join Ari Heljakka, Oguzhan (Ouz) Gencoglu, and Data Science Salon to explore: • Key misconceptions about #LLMjudges • Best practices for #EvalOps • How to build reliable, scalable evaluation systems Register now: https://lnkd.in/dZNMnWDh #LLMjudges #EvalOps
-
🌐AWS re:Invent Guide for #GenAI Developers If you're an engineer working with LLMs, Amazon Web Services (AWS) re:Invent 2024 in Las Vegas has a lot to offer. We’ve put together a guide to help you get the most out of the event. Here’s what we cover: 🆕 What’s new in 2024 for developers building with LLMs 📋 How to prepare for the event to maximize your time 🔎 A quick guide to understanding session abbreviations and tracks 🎯 Our curated list of sessions—from keynotes to hands-on bootcamps—that are especially valuable for GenAI engineers Whether you’re looking to learn about the latest tools, dive deep into technical sessions, or connect with the GenAI developer community, this guide has everything you need to navigate #AWSreInvent effectively. Check out the guide here: https://lnkd.in/dzrQ6fFz
-
#LLM judges in production can transform AI evaluation but comes with challenges like reliability, explainability, cost unpredictability, and maintainability when not implemented properly. 👨💻 Join Ari Heljakka, Oguzhan (Ouz) Gencoglu, and Data Science Salon to explore: • Key misconceptions about #LLMjudges • Best practices for #EvalOps. • How to build reliable, scalable evaluation systems. 📆 Don’t miss this webinar! Learn more at: https://lnkd.in/dEusmY-5
Tämä sisältö ei ole saatavilla täällä
Käytä tätä ja paljon muuta sisältöä LinkedIn-sovelluksessa
-
🚀 Root Signals is at Slush in Helsinki, Nov 20-21! Meet Ari Heljakka and Oguzhan (Ouz) Gencoglu to explore the future of #EvalOps and the best practices for #LLMevaluation. See you at #Slush2024?!
-
It was a pleasure to host 40+ LLM experts and developers at our offices for the LLM Developer event organized by Tuomas Lounamaa, Symposium AI & Root Signals. The evening featured insightful talks from Aapo Tanskanen, Rasmus Toivanen, Markus S. and a demo from our Head of AI, Oguzhan (Ouz) Gencoglu, showcasing our Control, Evaluation & Observability Platform for GenAI applications. A heartfelt thank you to everyone who attended and made this gathering so engaging. We’re excited to continue building and learning with this incredible community. Stay tuned for more events focused on advancing LLM development!
-
Root Signals julkaisi tämän uudelleen
#EvalsTuesdays Week 5 - Confirmation Bias in LLM-Judges LLM-as-a-Judge is the gift that keeps on giving (both joys and headaches). This week, we're tackling yet another bias that sneaks into our LLM evaluations: Confirmation Bias. Confirmation Bias: The tendency of LLM-Judges to favor responses that confirm their existing beliefs or the information presented in the prompt, while ignoring evidence to the contrary. In simpler terms, they might be agreeing with themselves a bit too much. Why does this matter? 🔍 It can lead to skewed evaluations, where certain types of responses are consistently overvalued or undervalued. 🧠 LLM-Judges may overlook errors or hallucinations if the response "sounds right" based on prior context. 🌐 This bias can be especially problematic in domains requiring critical analysis or when evaluating for factual accuracy. So, what's causing this? LLMs are trained on vast amounts of data, and they're great at picking up patterns. However, they also tend to reinforce patterns they've seen before. When an LLM-Judge evaluates a response, it might be more inclined to agree with content that aligns with those patterns, even if it's not the most accurate or helpful. How do we fight back? ✅ Diversify your prompts: Introduce variability in your evaluation prompts to prevent the model from getting too cozy with any one perspective. ✅ Encourage critical thinking: Incorporate instructions that nudge the LLM-Judge to consider alternative viewpoints or to critically assess the response. ✅ Meta-evaluation: Regularly test your LLM-Judges with known examples where the correct evaluation is counterintuitive, ensuring they're not just coasting on confirmation bias. At Root Signals, we obsess over these nuances so you don't have to. Our LLM-Judges are fine-tuned to spot not just the obvious issues but the subtle ones that can slip through the cracks. And most importantly, we provide a systematic and easy way to meta-evaluate = measure and tune your LLM-Judges. Remember, in the world of LLMs, vigilance is key. Don't let your judges get complacent - challenge them, test them, and keep them sharp. What's next? Maybe we'll dive into the rabbit hole of Chain-of-Thought prompting in LLM-Judges, maybe something else. Stay tuned!
-
Root Signals julkaisi tämän uudelleen
#EvalsTuesdays Week 4 - Verbosity Bias in LLM-Judges Creating reliable evaluation metrics for #LLMs by using LLMs, i.e. LLM-as-a-Judge, is more than simply writing an evaluation prompt and calling an API. One reason is LLM-Judges are full of biases and verbosity is one of them. Verbosity bias: LLM judges favor longer responses, even if they are not as clear, high-quality, or accurate as shorter alternatives. They are like lazy teachers who give high grades to longer essays because they are long. Here is a quick example from Google's #Gemini ⬇ . The first answer is not only more to-the-point but also more precise. It is simply more helpful. Yet, Gemini scores the rambling management consultant answer with a higher score. When it comes to being able to measure your LLM applications, the devil is in the details. If your metrics are not reliable in the first place, what's the point? Verbosity bias has a massive effect in all sorts of use cases where LLM-Judges are utilized. Judge scores need to be calibrated and normalized with respect to the length of the text that is being evaluated. But this normalization is not universal either and is model dependent. We worry about all these things at Root Signals so that our users don't need to. I would love to hear how you evaluate your GenAI appplications?
-
Root Signals julkaisi tämän uudelleen
Join us at #DSSSF for a session with Ari Heljakka, Founder and CEO of Root Signals, he will talk about: EvalOps - Mastering The Game of LLM Judges. This session will go into the innovative use of #LLMs as "judges" to oversee and refine the outputs of AI pipelines. With a focus on maintaining alignment with human norms and organizational policies, Ari's talk will explore the complexities and challenges of implementing these judge models effectively. This talk is essential for professionals involved in AI development and management who are looking to enhance the reliability and accountability of AI systems. In-person only on Nov 7 at Google HQ in SF. [Link in comments] #EvalOps #machinelearning #genAI