The Future of Evals - Ankur Goyal, Braintrust

Overview & Current State of Evals 00:00

  • Brain Trust has been working with leading companies to help them run AI product evaluations (evals) over the past two years.
  • Organizations on Brain Trust run an average of almost 13 evals daily; some customers run over 3,000 evals per day.
  • The most advanced users spend over two hours each day working through their evals within the product.
  • Current evaluation workflows focus primarily on reviewing dashboard data, which informs manual changes to code or prompts for improved performance.

Introducing Loop: Automating Evals 01:57

  • “Loop” is a new agent integrated into Brain Trust, aimed at automating significant parts of the eval process.
  • Loop leverages improvements in frontier AI models, specifically noting Claude 4 as a major breakthrough by performing almost six times better than previous top models.
  • The agent can automatically optimize prompts and handle complex agent setups, while also helping to create better datasets and scoring mechanisms.
  • Users can interact with Loop through the UI by previewing side-by-side edits to data, scorers, and prompts.
  • An optional setting allows Loop to fully automate optimizations for users who prefer less manual involvement.

Product Access & Future Outlook 03:04

  • Loop is currently available to all Brain Trust users; it can be activated via a feature flag.
  • Users can select from various AI models, including Claude 4, OpenAI models, Gemini, or user-developed LLMs.

Vision for the Future & Call to Action 04:05

  • Manual evaluation processes are soon to be transformed with automated, model-driven tools like Loop.
  • Brain Trust invites users to try Loop, provide feedback, and consider opportunities to join their team across UI, AI, and infrastructure roles.
  • Contact and application can be initiated via a QR code presented during the talk.