Gentrace

Gentrace

Software Development

Test generative AI across teams. Automate evals for reliable LLM products and agents.

About us

Test generative AI across teams. Automate evals for reliable LLM products and agents.

Industry
Software Development
Company size
2-10 employees
Type
Privately Held
Founded
2020
Specialties
AI, Infrastructure, Generative AI, LLMops, LLM, Analytics, Testing, DevOps Infrastructure, Monitoring, and Evaluation

Employees at Gentrace

Updates

  • Gentrace reposted this

    View profile for Doug Safreno, graphic

    Co-founder, CEO at Gentrace

    Lately I've seen some criticism of LLM as a judge evals. When I talk to devs struggling with this technique, 80% are asking the LLM the same question as the prompt: Avoid being the 80% by giving your LLM as a judge an "unfair advantage" — aka some additional context or capability that makes the eval task easier than the original generation task. https://lnkd.in/ezv2Y6Yp

    Unfair advantages - a framework for building LLM-as-a-judge evaluations that reliably work

    Unfair advantages - a framework for building LLM-as-a-judge evaluations that reliably work

  • With the rise of LLMs, the Webflow team set an ambitious goal to use natural language to make modifications to websites. To set up evals, they chose Gentrace. With Gentrace, the Webflow team: • Evaluates multimodal outputs (like website screenshots) using human and LLM-as-a-judge evals • Tests at scale with over 1000s evals/day • Allows over 25 stakeholders, including PMs and leadership, to contribute to evals Congrats to Bryant Chou and the Webflow team for the successful launch of their AI Assistant in October! We can't wait to see how Webflow continues to reinvent web development with AI.

    • No alternative text description for this image
  • View organization page for Gentrace, graphic

    1,916 followers

    With the generative AI market expected to grow to $38.7 billion by 2030, enterprise teams are looking for tools to democratize LLM testing. We believe the future of LLM testing is UI-first and connected to your app, so PMs, engineering, and domain experts can work together in the same tool. Learn more about our vision in Unite.AI's story covering our Series A: https://lnkd.in/eUT_ZTAC

    Gentrace Secures $8M Series A to Revolutionize Generative AI Testing

    Gentrace Secures $8M Series A to Revolutionize Generative AI Testing

    https://www.unite.ai

  • What if your LLM testing system could automatically optimize your prompts? That's where we're headed with Experiments, our new feature helping developers speed up last-mile tuning. Here's how it works: Unlike prompt playgrounds, Experiments provides a testing environment connected to your real application. From there, anything in your app (prompts, top-K, parameters) becomes a knob you can control from Gentrace. To get started: 1. Register your environment 2. Define your test interactions 3. Expose parameters to Gentrace Experiments is currently in beta, with prompt auto-optimization on the roadmap in 2025. Get started with our docs guide: https://lnkd.in/gNzkdC8H

    • No alternative text description for this image
  • Gentrace reposted this

    View profile for Doug Safreno, graphic

    Co-founder, CEO at Gentrace

    Big news today: Gentrace raised our $8M Series A led by Kojo Osei at Matrix. We’re celebrating by launching Experiments, the first collaborative testing environment for LLM product development. This year, we’ve helped our customers like Quizlet, Webflow, and Multiverse ship incredible LLM products. The one thing slowing everyone down? Evals. You can’t get to production without them, but they’re too hard to build and maintain. We’re changing that. Our customers grew testing by 40x adopting Gentrace because of our approach that connects a collaborative testing environment to your actual application code. Now PMs and engineers can work together to build evals that actually work. It’s a massive improvement to testing that ultimately makes generative AI products more reliable for everyone. This journey wouldn’t have been possible without our awesome angel investors: Yuhki Yamashita, Garrett Lord, Bryant Chou, Tuomas Artman, Martin Mao, David Cramer, Ben Sigelman, Steve Bartel, Cai GoGwilt, Linda Tong, Cristina Cordova, and many more. Thank you to our customers and team for believing in what we’re building! Every day, we’re inspired to make generative AI apps better because of your support.

  • Gentrace reposted this

    View profile for Doug Safreno, graphic

    Co-founder, CEO at Gentrace

    Most engineers approach LLM-as-a-judge all wrong. The usual high-level metrics like hallucination or safety rarely tell you if your app actually works as intended. Even with human evaluators, general metrics won’t help them judge your app’s performance in a useful way. Good evals are built on something specific to your app—a unique “unfair advantage” that gives the LLM clear criteria. LLM-as-a-judge is a widely used approach where an LLM grades another LLM’s output. But without an unfair advantage, it can fall short in quality. LLM-as-a-judge problems: • Circular reasoning: How can an LLM grade what it itself generated? • Poor initial results lead teams back to vibes and manual grading. Imagine an LLM app is tasked with writing emails. A bad eval would ask the LLM to rate itself on how well it followed the prompt—a circular question that adds little signal and won’t help improve the model. Instead, give your LLM unfair advantages. Here’s how: Add tailored asserts: specific criteria your model should meet. For instance, in our email example: • No footer • Directly asks a question • Includes recipient’s email Another approach is comparison to a reference output. You provide a high-quality reference email for comparison, so the LLM can check for: • Missing or extra information • Conciseness vs. verbosity To build reliable AI evaluations, always ask: how can I create an unfair advantage for the LLM These targeted methods ensure your LLM-as-a-judge evals give you actionable insights into the quality of your app. At Gentrace, we’re making it easy for AI engineers to build their own high-quality evals. My unfair advantage blog post shares more examples for using LLM-as-a-judge effectively: https://lnkd.in/gavtDhCY

    • No alternative text description for this image
  • View organization page for Gentrace, graphic

    1,916 followers

    New release, featuring production evaluator graphs and local evaluations / local test data. Production evaluator graphs Production evaluators now automatically create graphs to show how performance is trending over time. For example, you can create a "Safety" evaluator which uses LLM-as-a-judge to score whether an output is compliant with your AI safety policy. Then, you can see how the average output "Safety" trends over time. Local evaluations / local test data Gentrace now allows you to more easily define local evaluations and use completely local data / datasets. This makes Gentrace work better with existing unit testing frameworks and patterns. It also makes Gentrace incrementally adoptable into homegrown testing stacks. More: • User-specific view settings can be saved and overridden from a URL • Filter test runs by their input values • Added explicit compare button • Pinecone v3 (Node) support • o1 support • Fixed 68 bugs

    • No alternative text description for this image

Similar pages

Browse jobs

Funding

Gentrace 3 total rounds

Last Round

Series A

US$ 8.0M

See more info on crunchbase