Evals 101 — Doug Guthrie, Braintrust

Introduction to Braintrust and Evals 00:21

  • Doug Guthrie, a solutions engineer at Braintrust, introduced the platform as an end-to-end developer solution for building AI products, specifically focusing on evaluations (evals).
  • Braintrust's CEO, Anker Goyle, built the core idea of the platform from scratch in previous roles, leading to its current form.
  • Many companies currently use Braintrust in production for evals and observability of their Generative AI applications.
  • The importance of evals is highlighted by tech luminaries in the AI space.

The Value Proposition of Evals 03:13

  • Evals help answer critical questions, such as whether an application is improving or degrading when the underlying model or prompt is changed.
  • They provide a rigorous process for building with large language models (LLMs), which often have non-deterministic outputs, making application development difficult without evals.
  • Evals can detect regressions in code and serve as a "play offense" tool, providing rigor in development beyond just "defense" (like unit tests).
  • Running both offline (pre-production) and online (production) evals creates a powerful "flywheel effect" or feedback loop.
  • This flywheel effect helps cut development time, enhances application quality, and connects real-life user feedback from production back into offline development.
  • Customers using Braintrust have reported outcomes such as faster development, increased deployment of AI features, and higher application quality.

Core Concepts and Components of Evals 05:56

  • The Braintrust platform facilitates a "flywheel effect" by integrating prompt engineering (using a playground as an IDE for LLM outputs) with evaluability (logging, human review, and user feedback).
  • An eval is defined as a structured test that checks how well an AI system performs, measuring quality, reliability, and correctness.
  • The essential ingredients for an eval are:
    • Task: The code or prompt to be evaluated, which can range from a single prompt to a complex agentic workflow, requiring an input and an output.
    • Data set: Real-world examples against which the task is run to assess application performance.
    • Scores: The logic used to evaluate performance, including:
      • LLM as a judge: An LLM assesses the output based on defined criteria (e.g., excellent, fair, poor), mapping to numerical scores (e.g., 1, 0.5, 0).
      • Code-based scores: More heuristic or binary conditions written in code.
  • Evals operate in two modes:
    • Offline evals: Conducted pre-production to iterate, identify, and resolve issues before deployment by defining tasks and scores.
    • Online evals: Real-time tracing of applications in production, logging model inputs, outputs, intermediate steps, and tool calls to diagnose performance, reliability, and latency issues, and track metrics like cost and tokens.
  • For getting started, it's recommended to begin small and iterate, establishing a baseline rather than waiting to create a "golden data set."
  • A matrix helps improve evals: if output is good but score is low, improve evals; if output is bad but score is high, improve evals or scoring.

Deep Dive into Braintrust Platform for Evals 10:41

  • The Braintrust platform allows users to define a Task (prompt or agentic workflow) by specifying an underlying model, system prompt, tools, and mustache templating for variables like user questions or chat history.
  • The platform supports multi-turn chat scenarios and the ability to chain prompts, where the output of one prompt becomes the input for the next, facilitating agentic workflows.
  • Data Sets are test cases that can be sourced from real-world examples or by adding production logs (spans/traces). Inputs are required, with optional expected outputs for specific scores (like Levvenstein) and metadata for filtering.
  • Human review is crucial for filtering production logs and adding relevant test cases to data sets for offline evals.
  • Scores can be code-based (TypeScript or Python) defined either within the UI or pushed from a codebase for version control.
  • The "LLM as a judge" score type allows an LLM to evaluate output based on user-defined criteria and assign numerical scores.
  • Braintrust provides an internal package called "auto evals" with out-of-the-box LLM-as-a-judge and code-based scores (e.g., Levvenstein) for quick baselining.
  • Tips for effective scoring include using higher quality models for scoring (even if the prompt uses a cheaper model), breaking scores into focused areas (e.g., separate accuracy, formatting, correctness scores), testing score prompts in the playground, and avoiding overloading score prompts with excessive context.

Running Evals and Iteration 17:13

  • Playgrounds are the primary environment for rapid iteration, where users can load prompts, agents, data sets, and scores, then run them to understand task performance.
  • The platform allows for easy comparison of different underlying models (e.g., GPT-4) and prompts, showing their impact on various scores (e.g., completeness, accuracy, formatting).
  • A new "loop" feature enables AI to optimize prompts by analyzing evaluation results, making changes, and re-running evals to determine if performance improved.
  • Experiments provide snapshots of evals over time, allowing users to track application performance trends (e.g., over a month or six months) and ensure continuous improvement.
  • Evals can be executed directly from the Braintrust platform UI or programmatically via its SDK.

Leveraging the SDK for Evals 23:48

  • Braintrust offers SDKs for Python, TypeScript, Go, Java, and Kotlin, primarily used by Python and TypeScript users.
  • Users can define prompts, scores, and data sets within their codebase and push them into the Braintrust platform, enabling source control for these assets.
  • The SDK's eval class allows users to define evals directly in code by specifying the data set, task, and scores, then run these evals within Braintrust.
  • This flexibility caters to different organizational workflows, allowing development to start either in the UI or directly in the codebase.
  • Evals can be integrated into CI/CD pipelines (Continuous Integration/Continuous Deployment), allowing automated performance checks (e.g., ensuring scores don't drop below thresholds) as part of the development process, with GitHub Action examples available in the documentation.

Online Evals and Production Monitoring 27:43

  • Moving to production involves setting up logging by instrumenting the application with Braintrust code, such as wrapping LLM clients or specific functions.
  • This enables measuring application quality on live traffic with defined scores and leveraging the "flywheel effect" by easily adding production logs back to data sets for offline evals.
  • Online scoring can be configured within the platform to run on incoming logs, with options to specify a sampling rate (e.g., 10%, 20%) to avoid scoring every log.
  • Scores can be applied to individual spans within a complex application (e.g., tool calls, RAG workflows) for granular insights into performance at each step.
  • Braintrust supports early regression alerts, allowing users to set up automations that trigger alerts if scores drop below a predefined threshold.
  • Custom views can be created to filter rich log information, making it easier for humans to review specific logs (e.g., those with zero user feedback).

Human-in-the-Loop and Feedback 38:55

  • Human-in-the-loop interaction is critical for application quality and reliability, providing ground truth.
  • Two main types of human-in-the-loop interactions:
    • Human review: A dedicated interface within Braintrust allows users to easily parse through logs and add configured human review scores (e.g., free text comments, specific ratings).
    • User feedback: Direct feedback from end-users within the application (e.g., thumbs up/down, comments) can be logged to Braintrust.
  • This user feedback can then be used to create filtered views for human review and fed back into data sets for offline evals, completing the flywheel.
  • Different roles may be involved in human review; larger organizations might have specialized "product manager mixed with LLM specialist" roles, while in smaller organizations, engineers might perform this task.

Audience Questions 41:16

  • It is possible to have multiple models in production and compare their behavior, facilitating A/B testing of different models.
  • When multiple humans act as scorers, it's recommended to establish a rubric or guideline to ensure consistency in scoring criteria. The platform can show who scored what, but direct comparison of human scorer behavior is not a primary feature.
  • For LLM-as-a-judge scores used in online evals, customers often run "evals of evals" to assess the rationale and quality of the LLM's judgments.
  • The Braintrust tool supports pre-launch evaluations (offline evals) where subject matter experts can use large datasets to establish an accuracy baseline and build trust before an application is deployed.
  • All monitoring dashboard data is available via the SDK, allowing customers to pull data and build custom UIs or integrate with unified dashboards elsewhere.