Lately I've seen some criticism of LLM as a judge evals. When I talk to devs struggling with this technique, 80% are asking the LLM the same question as the prompt:
Avoid being the 80% by giving your LLM as a judge an "unfair advantage" — aka some additional context or capability that makes the eval task easier than the original generation task.
https://lnkd.in/ezv2Y6Yp
With the rise of LLMs, the Webflow team set an ambitious goal to use natural language to make modifications to websites. To set up evals, they chose Gentrace.
With Gentrace, the Webflow team:
• Evaluates multimodal outputs (like website screenshots) using human and LLM-as-a-judge evals
• Tests at scale with over 1000s evals/day
• Allows over 25 stakeholders, including PMs and leadership, to contribute to evals
Congrats to Bryant Chou and the Webflow team for the successful launch of their AI Assistant in October! We can't wait to see how Webflow continues to reinvent web development with AI.
With the generative AI market expected to grow to $38.7 billion by 2030, enterprise teams are looking for tools to democratize LLM testing. We believe the future of LLM testing is UI-first and connected to your app, so PMs, engineering, and domain experts can work together in the same tool.
Learn more about our vision in Unite.AI's story covering our Series A: https://lnkd.in/eUT_ZTAC
What if your LLM testing system could automatically optimize your prompts? That's where we're headed with Experiments, our new feature helping developers speed up last-mile tuning. Here's how it works:
Unlike prompt playgrounds, Experiments provides a testing environment connected to your real application. From there, anything in your app (prompts, top-K, parameters) becomes a knob you can control from Gentrace.
To get started:
1. Register your environment
2. Define your test interactions
3. Expose parameters to Gentrace
Experiments is currently in beta, with prompt auto-optimization on the roadmap in 2025. Get started with our docs guide:
https://lnkd.in/gNzkdC8H
Big news today: Gentrace raised our $8M Series A led by Kojo Osei at Matrix.
We’re celebrating by launching Experiments, the first collaborative testing environment for LLM product development.
This year, we’ve helped our customers like Quizlet, Webflow, and Multiverse ship incredible LLM products.
The one thing slowing everyone down? Evals.
You can’t get to production without them, but they’re too hard to build and maintain.
We’re changing that. Our customers grew testing by 40x adopting Gentrace because of our approach that connects a collaborative testing environment to your actual application code.
Now PMs and engineers can work together to build evals that actually work. It’s a massive improvement to testing that ultimately makes generative AI products more reliable for everyone.
This journey wouldn’t have been possible without our awesome angel investors: Yuhki Yamashita, Garrett Lord, Bryant Chou, Tuomas Artman, Martin Mao, David Cramer, Ben Sigelman, Steve Bartel, Cai GoGwilt, Linda Tong, Cristina Cordova, and many more.
Thank you to our customers and team for believing in what we’re building! Every day, we’re inspired to make generative AI apps better because of your support.
Gentrace is expanding our team.
We're looking for senior software engineers in NYC and SF who want to build tools to help the most advanced companies in the world make their generative AI systems reliable and predictable.
Learn more at
https://gentrace.ai/eng
Most engineers approach LLM-as-a-judge all wrong. The usual high-level metrics like hallucination or safety rarely tell you if your app actually works as intended.
Even with human evaluators, general metrics won’t help them judge your app’s performance in a useful way. Good evals are built on something specific to your app—a unique “unfair advantage” that gives the LLM clear criteria.
LLM-as-a-judge is a widely used approach where an LLM grades another LLM’s output. But without an unfair advantage, it can fall short in quality. LLM-as-a-judge problems:
• Circular reasoning: How can an LLM grade what it itself generated?
• Poor initial results lead teams back to vibes and manual grading.
Imagine an LLM app is tasked with writing emails. A bad eval would ask the LLM to rate itself on how well it followed the prompt—a circular question that adds little signal and won’t help improve the model.
Instead, give your LLM unfair advantages. Here’s how:
Add tailored asserts: specific criteria your model should meet. For instance, in our email example:
• No footer
• Directly asks a question
• Includes recipient’s email
Another approach is comparison to a reference output. You provide a high-quality reference email for comparison, so the LLM can check for:
• Missing or extra information
• Conciseness vs. verbosity
To build reliable AI evaluations, always ask: how can I create an unfair advantage for the LLM These targeted methods ensure your LLM-as-a-judge evals give you actionable insights into the quality of your app.
At Gentrace, we’re making it easy for AI engineers to build their own high-quality evals. My unfair advantage blog post shares more examples for using LLM-as-a-judge effectively: https://lnkd.in/gavtDhCY
New release, featuring production evaluator graphs and local evaluations / local test data.
Production evaluator graphs
Production evaluators now automatically create graphs to show how performance is trending over time.
For example, you can create a "Safety" evaluator which uses LLM-as-a-judge to score whether an output is compliant with your AI safety policy.
Then, you can see how the average output "Safety" trends over time.
Local evaluations / local test data
Gentrace now allows you to more easily define local evaluations and use completely local data / datasets.
This makes Gentrace work better with existing unit testing frameworks and patterns. It also makes Gentrace incrementally adoptable into homegrown testing stacks.
More:
• User-specific view settings can be saved and overridden from a URL
• Filter test runs by their input values
• Added explicit compare button
• Pinecone v3 (Node) support
• o1 support
• Fixed 68 bugs
Quizlet takes unstructured text and builds flash cards, syllabi, and other learning tools for students with generative AI. With Gentrace, they increased testing 40x and reduced test duration to 1 minute.
Learn more: https://lnkd.in/gF_Zi7Cd
Faire built an AI agent that reviews PRs.
They use an AI evaluator in Gentrace to review the reviewer.
Learn more about how Faire systematically develops their AI features in their blog post: https://lnkd.in/gf_arpB9