[Full Workshop] Building Metrics that actually work — David Karam, Pi Labs (fmr Google Search)

Introduction & Workshop Objectives 00:00

  • The session begins with participant introductions and a discussion about motivations for attending, particularly around challenges with evaluation (eval) in AI/ML projects.
  • Attendees mention difficulties such as defining correct answers, pragmatic improvement cycles, the labor-intensive setup, and the uniqueness of each evaluation case.
  • Participants express interest in synthetic data for evaluation, especially in scenarios lacking real-world data, and the limitations of current automated evaluation solutions.
  • Evaluation is described as essential in machine learning, analogous to experimentation in science.

The Importance of Evaluation & Methodology 06:01

  • Former Google engineers leading the workshop emphasize that thorough evaluation (eval) methodology is foundational to AI development, reflecting practices from Google's search quality efforts.
  • The process centers on setting up robust benchmarks, calibrating metrics with both human and user data, and iterating over a wide set of metrics (Google Search used around 300 signals).
  • Evaluations are not a one-time task but an ongoing feedback loop embedded into development and production systems.
  • The workshop promises hands-on practice and provides resources through a Slack channel and a detailed Google Doc.

Evaluation Approaches: From Vibe Testing to Scoring Systems 09:27

  • Most teams begin with simple "vibe testing"—manual trial and error, tweaking and observing results.
  • Human evaluations are expensive and manual, while code-based automated evaluations are more common.
  • Large Language Models (LLMs) as judges are attractive but often inaccurate; agent-based systems introduce more complexity.
  • The suggested approach is incremental: start with correlated signals (simple metrics), then iteratively add more as issues are discovered.
  • Focus is placed on the importance of scoring systems that gather multiple objective metrics and combine them into an overall score.

Building & Iterating Scoring Systems 15:13

  • The evaluation process is iterative: start simple, identify what’s broken, improve the metrics, and repeat.
  • Workshop participants will set up scoring systems for a meeting summarizer application, identifying relevant signals (criteria/dimensions) and adjusting them via a co-pilot assistant.
  • The co-pilot automates the breakdown of subjective goals into measurable signals, generates both good and bad synthetic examples, and allows for dynamic updates (such as adding a dimension for "title less than 20 words").
  • Weighting of signals (critical, major, minor) is managed mathematically; calibration is performed over time with sample data.

Integrating Scoring Systems with Workflows 23:43

  • Participants will export their scoring criteria to Google Sheets, apply them to real examples, and assess metric alignment with user feedback (thumbs up/down) using a confusion matrix.
  • The spreadsheet integration allows interactive experimentation—changing metrics on the fly and seeing impact on alignment.
  • The system is designed for high computational efficiency, running 20+ dimensions per input in under 50 milliseconds.

Evaluation at Scale and Best Practices 30:29

  • For production scenarios with tens/hundreds of thousands of examples, best practice is to use batch processing or online evaluation workflows.
  • The scorers are built for speed and large-scale deployment, leveraging bidirectional attention architectures and a regression head for high precision and stability, unlike generative LLMs.
  • Calibration methods (using thumbs up/down data) fine-tune the importance of various signals automatically.
  • Current support is for English and a few other languages, with plans for expansion; multimodal capabilities are on the roadmap.

Hands-On: Collab and Model Comparison 36:00

  • The workshop’s second phase focuses on using a Colab notebook to load data, apply the scoring spec, and generate confusion matrices—all with sample datasets from Hugging Face.
  • Users compare different model versions (e.g., v1.5 vs v2.5) and prompts using their custom metrics, observing score differences across model outputs.
  • The system also supports online response optimization: generating multiple outputs per input, scoring each, and selecting the highest quality response—demonstrating improved overall scores as more samples are considered.
  • The Colab environment is set up for easy experimentation and supports batch processes, model selection, and metric development.

Closing and Resources 40:18

  • Workshop documents and resources are shared via Google Doc and Slack for continued learning and collaboration.
  • Participants are encouraged to experiment further in their own time using the provided Collab, datasets, and scoring system tools.