The session begins with participant introductions and a discussion about motivations for attending, particularly around challenges with evaluation (eval) in AI/ML projects.
Attendees mention difficulties such as defining correct answers, pragmatic improvement cycles, the labor-intensive setup, and the uniqueness of each evaluation case.
Participants express interest in synthetic data for evaluation, especially in scenarios lacking real-world data, and the limitations of current automated evaluation solutions.
Evaluation is described as essential in machine learning, analogous to experimentation in science.
Former Google engineers leading the workshop emphasize that thorough evaluation (eval) methodology is foundational to AI development, reflecting practices from Google's search quality efforts.
The process centers on setting up robust benchmarks, calibrating metrics with both human and user data, and iterating over a wide set of metrics (Google Search used around 300 signals).
Evaluations are not a one-time task but an ongoing feedback loop embedded into development and production systems.
The workshop promises hands-on practice and provides resources through a Slack channel and a detailed Google Doc.
Evaluation Approaches: From Vibe Testing to Scoring Systems 09:27
Most teams begin with simple "vibe testing"—manual trial and error, tweaking and observing results.
Human evaluations are expensive and manual, while code-based automated evaluations are more common.
Large Language Models (LLMs) as judges are attractive but often inaccurate; agent-based systems introduce more complexity.
The suggested approach is incremental: start with correlated signals (simple metrics), then iteratively add more as issues are discovered.
Focus is placed on the importance of scoring systems that gather multiple objective metrics and combine them into an overall score.
The evaluation process is iterative: start simple, identify what’s broken, improve the metrics, and repeat.
Workshop participants will set up scoring systems for a meeting summarizer application, identifying relevant signals (criteria/dimensions) and adjusting them via a co-pilot assistant.
The co-pilot automates the breakdown of subjective goals into measurable signals, generates both good and bad synthetic examples, and allows for dynamic updates (such as adding a dimension for "title less than 20 words").
Weighting of signals (critical, major, minor) is managed mathematically; calibration is performed over time with sample data.
Participants will export their scoring criteria to Google Sheets, apply them to real examples, and assess metric alignment with user feedback (thumbs up/down) using a confusion matrix.
The spreadsheet integration allows interactive experimentation—changing metrics on the fly and seeing impact on alignment.
The system is designed for high computational efficiency, running 20+ dimensions per input in under 50 milliseconds.
For production scenarios with tens/hundreds of thousands of examples, best practice is to use batch processing or online evaluation workflows.
The scorers are built for speed and large-scale deployment, leveraging bidirectional attention architectures and a regression head for high precision and stability, unlike generative LLMs.
Calibration methods (using thumbs up/down data) fine-tune the importance of various signals automatically.
Current support is for English and a few other languages, with plans for expansion; multimodal capabilities are on the roadmap.
The workshop’s second phase focuses on using a Colab notebook to load data, apply the scoring spec, and generate confusion matrices—all with sample datasets from Hugging Face.
Users compare different model versions (e.g., v1.5 vs v2.5) and prompts using their custom metrics, observing score differences across model outputs.
The system also supports online response optimization: generating multiple outputs per input, scoring each, and selecting the highest quality response—demonstrating improved overall scores as more samples are considered.
The Colab environment is set up for easy experimentation and supports batch processes, model selection, and metric development.