2025 is the Year of Evals! Just like 2024, and 2023, and … — John Dickerson, CEO Mozilla AI

Introduction and Background 00:00

  • The speaker, now CEO of Mozilla AI, was previously at Arthur AI, a company focused on observability, evaluation, and security in AI and ML.
  • Mozilla AI is committed to supporting open source AI tooling and empowering the open source community.

Why 2025 is the "Year of Evaluations" 00:43

  • The intersection of system autonomy, increased enterprise awareness, and previously frozen IT budgets now being directed toward AI projects is driving the importance of evaluations (evals).
  • AI/ML monitoring and evaluation are intrinsically linked, as measuring performance is necessary for both.
  • AI became broadly understood across the enterprise after ChatGPT's launch, coinciding with budget freezes that left only pet projects like GenAI funded.
  • The shift from ML models providing outputs to autonomous/agentic systems taking action increases the stakes for proper evaluation.

Evolution of Enterprise AI Investment 04:19

  • Prior to late 2022, ML monitoring was established but usually limited to technical teams; impact on business KPIs was often acknowledged but not deeply prioritized.
  • Selling AI/ML solutions often struggled to gain attention beyond CIOs, with budget and attention more often allocated to issues like security or latency.
  • Despite the narrative that a CEO could get fired over an ML blunder, this hasn't materialized, and overall investment numbers in AI at large organizations (e.g., JPMC’s $100M over several years) were relatively small.
  • Economic fears in late 2022 led to widespread IT budget freezes for 2023, which might have stalled innovation if not for the arrival of ChatGPT.

The ChatGPT Effect and Budget Shifts 07:38

  • ChatGPT's launch at the end of 2022 captured the attention of senior executives, who experienced AI firsthand and became advocates for AI-driven projects.
  • Discretionary budget was unlocked for GenAI projects, with most new project funding in 2023 directed to GenAI pilots and science projects.

Transition to Production and Scaling Concerns 09:33

  • By 2024, GenAI applications were moving into production in the form of internal chat apps, hiring tools, and more.
  • As deployments scaled, business leaders began focusing on ROI, governance, risk, compliance, and brand impact, necessitating strong evaluation frameworks.
  • CFOs now demand quantitative risk and performance estimates, further making evaluation a priority.

2025: Maturity, Adoption, and Opportunities for Evaluation 10:25

  • AI budgets are increasing, science projects are moving to production, and 2025 is expected to see scaling and broad enterprise adoption.
  • The pace of AI model and product development has accelerated, with open source, venture capital, and big tech all contributing.
  • AI systems are increasingly autonomous, making robust evaluation essential for risk mitigation and business alignment.

The New Enterprise Stakeholder Landscape 12:08

  • Evaluation is now a key topic at all levels of enterprise leadership: CEOs, CFOs, CISOs, CIOs, CTOs.
  • CEOs are more knowledgeable and engaged with AI, influencing CFOs and others to prioritize investment and discussion.
  • CISOs have already started adopting tools for security-specific concerns (e.g., hallucination detection, prompt injection).
  • CIOs and CTOs require standardized metrics for decision-making, promoting evaluation as industry best practice.

Monitoring Multi-Agent Systems and Industry Trends 14:48

  • Companies in monitoring, observability, and security have shifted focus to encompass multi-agent system evaluation.
  • Monitoring the entire system, not just individual models or agents, is increasingly being adopted by both industry and government.
  • Recent reports on startup revenue in the evaluation space are outdated; the field is growing faster than previous data suggests.

Challenges: Domain Expertise and Human-in-the-loop Evaluation 16:18

  • Evaluations in generative AI, especially for domains like finance, require human experts due to complex, unstructured tasks (e.g., discounted cash flow analysis).
  • Expensive expert validation is currently necessary, especially for high-stakes use cases.
  • The importance of high-quality datasets and evaluation environments is highlighted, with significant investment in these competitive assets.

LLMs as Evaluators and Future Trends 18:17

  • Using LLMs as judges for evaluation (the "LLM as a judge" paradigm) is common practice, despite known biases compared to human evaluators.
  • LLM-based evaluation helps with dataset creation but requires careful validation to avoid propagating flaws or biases.
  • The field is actively researching and addressing these limitations, with human-in-the-loop validation currently essential.