How to run Evals at Scale: Thinking beyond Accuracy or Similarity — Muktesh Mishra, Adobe

Introduction to Evals and Their Importance 00:21

  • Evals are a key component in AI application development, necessary for crafting and improving systems.
  • Challenges in AI include testing applications with nondeterministic outputs and subjective judgments.
  • Evals provide the required tools and metrics to measure AI model performance and accuracy.
  • Measurement is crucial for business impact, aligning applications with goals, continuous improvement, trust, and accountability.

Data-Centric Evaluation Strategies 03:18

  • The process starts with sourcing data, often beginning small with synthetic or artificial data.
  • Evaluation is a continuous improvement process, requiring ongoing observation and refinement of data sets.
  • Proper data labeling is essential to cover multiple application flows and aspects.
  • Multiple, specialized data sets are recommended for comprehensive coverage of different flows or goals.

Designing and Adapting Evals 04:54

  • Clearly define evaluation goals and objectives for each system component.
  • Modularize components and optimize data handling for distinct flows.
  • Each application flow or output path must be evaluated systematically.
  • There is no universal eval; methods must be adapted to the specific application type (e.g., RAG, code generation, Q&A).
  • Evaluation targets can differ: accuracy, similarity, usefulness, functional correctness, robustness, etc.
  • For agents, trajectory and multi-turn simulation evaluation are necessary to capture complexity and conversation paths.
  • Tool use cases require checks for correctness and thorough test suites.

Scaling Evals and Process Strategies 07:03

  • Employ orchestration and parallelism to manage and run evals at scale, with frequent and regular execution.
  • Aggregate intermediate and regression results for broader analysis.
  • Measurement should follow a cyclical process: measure, monitor, analyze, and repeat.
  • Strategy selection depends on use case; there is no fixed eval method.
  • Trade-offs exist between human-in-the-loop fidelity and automated speed; a balanced approach is suggested.
  • The evaluation process should take precedence over reliance on specific tools due to automation limits.

Key Takeaways and Best Practices 08:27

  • Evals are fundamental to robust AI application development.
  • The "eval development" paradigm emerges, akin to test-driven development, but focused on crafting evals for specific use cases.
  • Both positive and negative scenarios must be included in evaluations.
  • Data quality and coverage remain the central priority.
  • Teams must continuously monitor, analyze, and iterate on eval results for ongoing improvement.
  • Balancing fidelity and speed—and adapting methodology to needs—is critical.