[Evals Workshop] Mastering AI Evaluation: From Playground to Production

Introduction & Agenda 00:00

  • The workshop introduces Brain Trust and focuses on mastering AI evaluation (eval) from initial experimentation to production deployment.
  • Attendees are encouraged to join the Slack channel for materials and real-time assistance.
  • The session alternates between lectures (slides) and hands-on activities, covering offline and online evaluation, SDK usage, production logging, and human-in-the-loop processes.

Why Evaluate AI? 03:25

  • Evals are necessary to answer key questions: which model to choose, cost effectiveness, edge case handling, brand consistency, bug detection, and iterative improvement.
  • Even top-performing LLMs can have inconsistent results, hallucinations, or regressions from prompt/model changes.
  • Evals enable faster development, reduced costs (by replacing manual review), optimized model selection, improved release cycles, quality assurance, and scalable team collaboration.

Brain Trust Core Concepts 06:31

  • Focus areas: prompt engineering, evaluating prompt/model changes, and AI system observability.
  • Evals are structured tests measuring quality, reliability, and correctness across scenarios.
  • Three key components: a task (input/output to evaluate), a data set (examples/test cases), and a score (evaluation logic, either LLM-as-a-judge or code-based, producing 0-1 values).

Data Sets and Scoring 08:47

  • Data sets can start with synthetic examples (possibly AI-generated) and iteratively expand with real user logs.
  • Keep data sets small initially; continuous augmentation with real interactions is recommended.
  • Scoring types:
    • LLM-as-a-judge: qualitative, subjective criteria (better models recommended, focus on a single criterion, validate prompts against human judgment).
    • Code-based: deterministic, for binary/objective checks.
    • Using both scoring types provides a fuller picture.

Offline vs Online Evals 09:20

  • Offline evals: Used during development to catch issues before release, typically run in Brain Trust Playground or SDK.
  • Online evals: Measure production traffic in real time, allowing diagnosis, monitoring, and feedback integration for live systems.

Setting Up and Running Evals (Playground & SDK) 20:32

  • Project setup involves importing code, prompts, scores, and data sets into Brain Trust, often requiring an OpenAI or other AI provider API key.
  • The Brain Trust Playground enables quick iteration and AB testing of prompts and models, with the ability to track historical results via experiments.
  • SDK allows source control, scripted creation, and versioning of prompts, datasets, and scores, supporting both JavaScript/TypeScript and Python.
  • Evals pushed/running via SDK can be tied to CI workflows, ensuring changes are continuously measured.

Experiments and Best Practices 31:23

  • Playgrounds are for fast, ephemeral iterations; experiments track longer-term, historical results and performance trends.
  • You can compare different models, prompts, or configurations, and analyze how changes affect performance over time.

Logging, Observability, and Online Scoring 55:26

  • Production logging through SDK wrappers enables observability: input/output tracking, prompt utilization, latency, token usage, errors, and cost.
  • Logs from live user traffic can be sampled/scored online; poor/good outputs can be flagged for further review or improvement.
  • Custom views in Brain Trust allow teams to filter, sort, and save important log perspectives for collaboration and rapid issue detection.

Human-in-the-Loop & Feedback Loops 70:06

  • Human review remains vital for catching nuanced errors, hallucinations, or validating LLM-as-a-judge outputs.
  • Two primary modes:
    • Manual review/annotation via Brain Trust interface (by SMEs, PMs, etc).
    • In-app user feedback (e.g., thumbs up/down and comments) directly logged back to Brain Trust.
  • Both feedback channels feed into data sets, strengthening future evals and ground-truth alignment.

Iteration and Continuous Improvement 76:05

  • Recommended approach: start with minimal data (even 10 examples and 1-2 scores), and iterate rather than waiting for a large data set or "perfect" scoring mechanism.
  • As application logic evolves (e.g., more steps/turns), ensure evals reference updated tasks, with minimal maintenance needed if integrated via SDK.
  • Continuous feedback from online logs, human reviews, and updated data ensures evals and the system evolve in tandem.

Q&A and Advanced Topics 46:44, 80:43

  • Addressed non-determinism concerns with LLM-as-a-judge (suggest trial runs and averaging scores).
  • Discussed balancing code-based, ML-based, and LLM-based scorers for different scenarios.
  • Clarified integration with other frameworks (e.g., LangSmith), ease of wrapping models/prompts for logging/evaluation, and creation of data sets from logs.
  • Explored the role of few-shot prompting using collected data/examples.
  • Discussed how to handle evolving application logic and ensuring evals remain relevant as code and workflows change.
  • Human annotation roles and practices discussed, with a note on external annotation for complex domains (e.g., healthcare).

Conclusion 84:46

  • Workshop reaffirms the value of early, iterative evals and leveraging both automated and human-driven evaluation.
  • Participants are encouraged to visit the booth or reach out with further questions.