SUMM

The workshop introduces Brain Trust and focuses on mastering AI evaluation (eval) from initial experimentation to production deployment.
Attendees are encouraged to join the Slack channel for materials and real-time assistance.
The session alternates between lectures (slides) and hands-on activities, covering offline and online evaluation, SDK usage, production logging, and human-in-the-loop processes.

Evals are necessary to answer key questions: which model to choose, cost effectiveness, edge case handling, brand consistency, bug detection, and iterative improvement.
Even top-performing LLMs can have inconsistent results, hallucinations, or regressions from prompt/model changes.
Evals enable faster development, reduced costs (by replacing manual review), optimized model selection, improved release cycles, quality assurance, and scalable team collaboration.

Focus areas: prompt engineering, evaluating prompt/model changes, and AI system observability.
Evals are structured tests measuring quality, reliability, and correctness across scenarios.
Three key components: a task (input/output to evaluate), a data set (examples/test cases), and a score (evaluation logic, either LLM-as-a-judge or code-based, producing 0-1 values).

Data sets can start with synthetic examples (possibly AI-generated) and iteratively expand with real user logs.
Keep data sets small initially; continuous augmentation with real interactions is recommended.
Scoring types:
- LLM-as-a-judge: qualitative, subjective criteria (better models recommended, focus on a single criterion, validate prompts against human judgment).
- Code-based: deterministic, for binary/objective checks.
- Using both scoring types provides a fuller picture.

Offline evals: Used during development to catch issues before release, typically run in Brain Trust Playground or SDK.
Online evals: Measure production traffic in real time, allowing diagnosis, monitoring, and feedback integration for live systems.

Project setup involves importing code, prompts, scores, and data sets into Brain Trust, often requiring an OpenAI or other AI provider API key.
The Brain Trust Playground enables quick iteration and AB testing of prompts and models, with the ability to track historical results via experiments.
SDK allows source control, scripted creation, and versioning of prompts, datasets, and scores, supporting both JavaScript/TypeScript and Python.
Evals pushed/running via SDK can be tied to CI workflows, ensuring changes are continuously measured.

Playgrounds are for fast, ephemeral iterations; experiments track longer-term, historical results and performance trends.
You can compare different models, prompts, or configurations, and analyze how changes affect performance over time.

Production logging through SDK wrappers enables observability: input/output tracking, prompt utilization, latency, token usage, errors, and cost.
Logs from live user traffic can be sampled/scored online; poor/good outputs can be flagged for further review or improvement.
Custom views in Brain Trust allow teams to filter, sort, and save important log perspectives for collaboration and rapid issue detection.

Human review remains vital for catching nuanced errors, hallucinations, or validating LLM-as-a-judge outputs.
Two primary modes:
- Manual review/annotation via Brain Trust interface (by SMEs, PMs, etc).
- In-app user feedback (e.g., thumbs up/down and comments) directly logged back to Brain Trust.
Both feedback channels feed into data sets, strengthening future evals and ground-truth alignment.

Recommended approach: start with minimal data (even 10 examples and 1-2 scores), and iterate rather than waiting for a large data set or "perfect" scoring mechanism.
As application logic evolves (e.g., more steps/turns), ensure evals reference updated tasks, with minimal maintenance needed if integrated via SDK.
Continuous feedback from online logs, human reviews, and updated data ensures evals and the system evolve in tandem.

Addressed non-determinism concerns with LLM-as-a-judge (suggest trial runs and averaging scores).
Discussed balancing code-based, ML-based, and LLM-based scorers for different scenarios.
Clarified integration with other frameworks (e.g., LangSmith), ease of wrapping models/prompts for logging/evaluation, and creation of data sets from logs.
Explored the role of few-shot prompting using collected data/examples.
Discussed how to handle evolving application logic and ensuring evals remain relevant as code and workflows change.
Human annotation roles and practices discussed, with a note on external annotation for complex domains (e.g., healthcare).

Workshop reaffirms the value of early, iterative evals and leveraging both automated and human-driven evaluation.
Participants are encouraged to visit the booth or reach out with further questions.

[Evals Workshop] Mastering AI Evaluation: From Playground to Production