The workshop introduces Brain Trust and focuses on mastering AI evaluation (eval) from initial experimentation to production deployment.
Attendees are encouraged to join the Slack channel for materials and real-time assistance.
The session alternates between lectures (slides) and hands-on activities, covering offline and online evaluation, SDK usage, production logging, and human-in-the-loop processes.
Evals are necessary to answer key questions: which model to choose, cost effectiveness, edge case handling, brand consistency, bug detection, and iterative improvement.
Even top-performing LLMs can have inconsistent results, hallucinations, or regressions from prompt/model changes.
Evals enable faster development, reduced costs (by replacing manual review), optimized model selection, improved release cycles, quality assurance, and scalable team collaboration.
Focus areas: prompt engineering, evaluating prompt/model changes, and AI system observability.
Evals are structured tests measuring quality, reliability, and correctness across scenarios.
Three key components: a task (input/output to evaluate), a data set (examples/test cases), and a score (evaluation logic, either LLM-as-a-judge or code-based, producing 0-1 values).
Recommended approach: start with minimal data (even 10 examples and 1-2 scores), and iterate rather than waiting for a large data set or "perfect" scoring mechanism.
As application logic evolves (e.g., more steps/turns), ensure evals reference updated tasks, with minimal maintenance needed if integrated via SDK.
Continuous feedback from online logs, human reviews, and updated data ensures evals and the system evolve in tandem.
Addressed non-determinism concerns with LLM-as-a-judge (suggest trial runs and averaging scores).
Discussed balancing code-based, ML-based, and LLM-based scorers for different scenarios.
Clarified integration with other frameworks (e.g., LangSmith), ease of wrapping models/prompts for logging/evaluation, and creation of data sets from logs.
Explored the role of few-shot prompting using collected data/examples.
Discussed how to handle evolving application logic and ensuring evals remain relevant as code and workflows change.
Human annotation roles and practices discussed, with a note on external annotation for complex domains (e.g., healthcare).