Engineering Better Evals: Scalable LLM Evaluation Pipelines That Work — Dat Ngo, Aman Khan, Arize

Introduction to Arise AI & LLM Evals 00:18

  • Dat No, AI Architect at Arise AI, introduces the company as a leading player in AI observability and evaluations.
  • Arise AI works with major AI teams like Reddit and Duolingo, observing their building processes, pain points, and productionization tips.
  • Duolingo, for example, runs approximately 20 evaluations per trace, incurring significant costs for understanding and optimizing them.

Core Concepts: Observability and Evaluations 02:16

  • Observability answers "what is the thing I built actually doing?" and can involve traces, spans, conversations, or analytics.
  • LLM teams are typically split into two niches: platform teams (owning infrastructure, caring about cost and latency) and business-side LLM teams (building applications, focusing on evaluations).
  • Evaluations (Evals) are crucial because manually inspecting every trace is not scalable; they serve as a "clever word for signal" to understand what's working and what's not.
  • LLM as a Judge involves using an LLM to provide feedback on any process, including an LLM process (e.g., evaluating RAG relevance by comparing retrieved context to a query).
  • Beyond LLM as a judge, other tools for evals include encoder-only BERT-type models (10x cheaper, 1-2 orders of magnitude faster) and human feedback (user feedback, golden datasets).

Advanced Eval Strategies and Virtuous Cycles 06:17

  • A "pro tip" for using golden datasets is to run an LLM as a judge on them to quantify and tune the LLM's ability to approximate trusted human grades.
  • Heuristics or code-based logic can also be used for evals, being infinitely cheaper and faster for tasks like checking for keywords, regex patterns, or parsable JSON.
  • The process of building a quality AI product involves a "left-hand cycle": collecting data (observability), running evals to discern signal, identifying failure areas (e.g., hallucination), annotating data, and updating prompts or models.
  • A "right-hand cycle" focuses on tuning the evals themselves: collecting instances where the eval was wrong and improving the eval prompt template for better specificity.
  • The speaker emphasizes that faster iteration through these cycles (e.g., 4 iterations a month instead of 2) leads to an exponentially better AI product.

Eval Architectures and Agent Evals 10:33

  • Evals can be applied at different levels of complexity: individual components within a trace, larger workflows (input/output of combined LLM/API calls), or control flow within agents using conditional evals.
  • Conditional evals save cost by preventing further evaluation if the control flow is incorrect.
  • Evals can also be applied at the highest level, such as evaluating an entire session (a series of traces) to understand overall customer experience (e.g., frustration).
  • A key hot take is to "not use out-of-the-box evals" as they lead to "out-of-the-box results"; heavy customization is recommended.
  • Arise AI has developed an AI co-pilot whose purpose is to troubleshoot, observe, and build evals for AI systems, anticipating a future where AI evaluates future AI.
  • Agent evaluations are particularly complex due to longer calls and traces, shifting the focus to identifying specific failure modes of an agent.
  • An "agent graph" provides a framework-agnostic way to visualize and analyze an agent's pathing across aggregate traces, helping to understand how different component sequences affect eval performance.
  • "Trajectory evals" involve using golden datasets for expected agent paths (e.g., hitting specific components), which can then be graded by an LLM or checked for explicit matches.

Q&A and Future Outlook 18:06

  • Inline evals (guardrails): Can be run "in orchestration" (inline) or "out of orchestration." While they mitigate risk, they can introduce latency and complexity, and often, the root cause of issues lies in the "system one" (the prompt/orchestration) rather than the guardrail itself. Guardrails are for "known knowns," unlike observability + evals which handle unknown distributions.
  • Asynchronous processes: Open Telemetry (OTEL) is crucial for enterprise customers to track traces across distributed services and asynchronous processes, allowing a holistic view of the system's work.
  • Confidence scores on evals: For auto-regressive models, the log probability of the generated eval label can serve as a pseudo-confidence score; for encoder-only models, it's the classification probability.
  • Automating feedback loops: Arise AI is working on "metaprompting," where an LLM is fed data (inputs, outputs, evals, failures, original prompt) to automatically generate a new, optimized prompt.