SUMM

AI is widely used but not trusted, as it often "makes stuff up" or "hallucinates."
Examples include the Chicago Sun Times publishing a summer reading list with a non-existent book ("The Last Algorithm" by Andy Weir) generated by AI.
Lawyers have cited false case law generated by AI, such as Butler Snow defending the Alabama prison system.
Air Canada's chatbot made a promise that the company was legally obliged to honor, demonstrating AI's potential to create liabilities.
Detecting problems with AI is difficult because it is non-deterministic, unlike traditional code where unit tests can predict outputs.
Evaluating complex agentic workflows, where an AI calls multiple LLMs, agents, and tools, is particularly challenging and makes it hard to define what "worked."

A solution to evaluating AI's non-deterministic behavior is to "set an AI to verify an AI," as AIs are about as good as humans at determining if another AI worked.
A chatbot demo illustrated the challenge: asking for an account balance took three steps (initial query, clarifying question, providing account name) to get the desired result, leading to debate on whether it "worked."
Comprehensive evaluation requires looking at all steps in a workflow and defining various metrics, such as successful tool calls, correct information retrieval from RAG systems, sensible answers, and hallucination detection.
Granularity is essential, breaking down evaluations by individual steps and components in multi-agent applications to pinpoint where failures occur.
An LLM (often multiple calls to one) is used to evaluate metrics, ideally a better and more expensive LLM than the one used in the main application.
Galileo offers a custom-trained small language model specifically designed for evaluations.
Evaluations should be integrated from day one, during prompt engineering and model selection, and continued through the dev cycle, CI/CD pipelines, and production.

Key metrics include "Action completion" (did the task complete from input to output) and "Action advancement" (did it move forward towards the end goal).
In the chatbot demo, the initial query showed no completion or advancement, the follow-up showed no completion but advancement, and the final clarifying input led to both completion and advancement.
Individual traces provide granular breakdowns, showing metrics at each step (e.g., LLM calling a tool, processing data).
AI can provide smart insights by analyzing evaluation data, suggesting improvements (e.g., an LLM occasionally failing to use a tool when asked about account balances).
Suggested actions, like adding explicit instructions to a system message, are provided by AI, but human oversight is crucial ("human in the loop").
Continuous learning by human feedback (CLHF) is necessary because evaluation metrics are not perfect out of the box and require continuous training and refinement by humans.

Implement evaluations immediately, if not from the start, to prevent AI agents from generating false content or "making stuff up."
Precisely define custom measurements specific to the application's needs, such as toxicity, hallucinations, incomprehensible output, RAG performance, or other unique use cases.
These measurements should be defined at the design stage, considering prompts, app structure, and agents.
Evaluations must continue throughout production, as user interactions can reveal unexpected issues.
Implement real-time prevention and alerting systems to notify when an AI agent goes rogue.

Taming Rogue AI Agents with Observability-Driven Evaluation — Jim Bennett, Galileo