SUMM

Brain Trust has been working with leading companies to help them run AI product evaluations (evals) over the past two years.
Organizations on Brain Trust run an average of almost 13 evals daily; some customers run over 3,000 evals per day.
The most advanced users spend over two hours each day working through their evals within the product.
Current evaluation workflows focus primarily on reviewing dashboard data, which informs manual changes to code or prompts for improved performance.

“Loop” is a new agent integrated into Brain Trust, aimed at automating significant parts of the eval process.
Loop leverages improvements in frontier AI models, specifically noting Claude 4 as a major breakthrough by performing almost six times better than previous top models.
The agent can automatically optimize prompts and handle complex agent setups, while also helping to create better datasets and scoring mechanisms.
Users can interact with Loop through the UI by previewing side-by-side edits to data, scorers, and prompts.
An optional setting allows Loop to fully automate optimizations for users who prefer less manual involvement.

Loop is currently available to all Brain Trust users; it can be activated via a feature flag.
Users can select from various AI models, including Claude 4, OpenAI models, Gemini, or user-developed LLMs.

Manual evaluation processes are soon to be transformed with automated, model-driven tools like Loop.
Brain Trust invites users to try Loop, provide feedback, and consider opportunities to join their team across UI, AI, and infrastructure roles.
Contact and application can be initiated via a QR code presented during the talk.

The Future of Evals - Ankur Goyal, Braintrust