AI Engineer World's Fair 2025 - Evals

Introduction to Evals 00:00

  • The session focuses on the importance of evaluations (evals) in AI systems, especially in the context of emerging generative AI technologies.
  • The speaker emphasizes the need for companies to rapidly adapt their products to new AI models and incorporate user feedback effectively.

The Evolution of AI Eval Practices 02:00

  • Prior to the launch of ChatGPT, machine learning monitoring was often disconnected from business needs.
  • The introduction of generative AI technologies has shifted the conversation, leading to increased interest from CEOs and CFOs in AI evaluation.

Key Signs of Effective Evals 06:45

  • Successful organizations can quickly incorporate new AI models into their products, ideally within 24 hours.
  • Companies should have a clear process for converting user complaints into actionable evals to continuously improve their systems.

Engineering Great Evals 10:30

  • Evals must be purposefully engineered rather than relying on synthetic data or generic scoring systems.
  • The importance of aligning datasets with user experiences is highlighted, advocating for continuous reconciliation of data with real-world scenarios.

Importance of Context in AI 14:00

  • There is a need to consider how tools within AI systems are defined and how their outputs are formatted for optimal performance.
  • Effective scoring systems should include both code-based and LLM-as-judge approaches, tailored to specific applications.

Preparing for New AI Models 18:30

  • Organizations should be ready to pivot when new AI models are released, ensuring that their systems can quickly adapt to leverage the latest advancements.

System Optimization for Evals 22:00

  • Enhancing the entire AI system, including tasks and scoring functions, is crucial for improving eval performance.
  • The speaker discusses a new feature called "Loop," which auto-optimizes eval tasks using AI, simplifying the process of improving AI applications.

Conclusion 25:45

  • The final messages emphasize using evals to drive product development decisions based on user data and feedback.
  • The session concludes with an invitation for further questions and discussion on the topic.

Q&A Session 27:30

  • Attendees engage with the speaker, asking questions about specific applications of evals, the integration of human feedback, and the future direction of AI evaluation practices.