Agentic Excellence: Mastering AI Agent Evals w/ Azure AI Evaluation SDK — Cedric Vidal, Microsoft

Introduction to AI Agent Evaluation 00:21

  • Cedric Vidal, Principal AI Advocate at Microsoft, discusses evaluating AI agents, contrasting it with red teaming which focuses on creating bad data.
  • While AI agents show significant progress, increased agency leads to higher risks of creating negative outcomes.
  • A methodical approach to evaluation is necessary, as simply testing with a couple of prompts is insufficient.

Initial Evaluation Steps 03:14

  • Evaluation should commence at the very beginning of an AI development project.
  • There are four distinguished layers for agent evaluation: the model and safety system (platform-specific, built into Azure), system message and grounding, and user experience.
  • The foundation model is only one part of safety; true safety is achieved by layering smart mitigations at the application level.

Manual Model Evaluation with VS Code AI Toolkit 04:16

  • The first step is manual model evaluation to understand how different models respond to specific prompts, which can reveal nuances missed by automatic metrics.
  • The VS Code AI Toolkit plugin allows developers to compare model responses side-by-side within their development environment.
  • For example, GPT-4.1 offers significant throughput improvements over GPT-4.0, delivering answers much faster, and was preferred for quality in a specific recipe example.

Evaluating the Full AI Agent System 07:12

  • Once a foundation model is selected, the next step is to evaluate the entire AI agent system end-to-end.
  • The AI Toolkit in VS Code facilitates rapid building and evaluation of AI agents.
  • An example agent was demonstrated that extracts agenda and event information from web pages, customizing a sample web scraper agent to use GPT-4.1.
  • This agent used Playwright to navigate web pages, extracting and combining information (e.g., event name, date, location, and attendee count from a Luma page linked from a Reactor page).
  • The evaluation tab in the AI Toolkit allows running the agent on multiple inputs, reviewing responses, and exporting results to a JSON file for further analysis.

Scaling Automated Evaluations with Azure AI Foundry 13:21

  • To scale beyond local manual checks, automated evaluation is crucial for thorough and wide-ranging checks.
  • Azure AI Foundry provides a wide range of built-in evaluators, including AI-assisted quality checks (groundedness, fluency, coherence), classic NLP metrics (F1, BLEU, ROUGE), and AI-assisted risk and safety evaluators.
  • Evaluations can be performed via the Azure Foundry portal or programmatically using Python code, as demonstrated in a notebook.
  • The evaluate function takes evaluators and a dataset to bulk evaluate an AI agent on various metrics, providing scores typically between 1 and 5.
  • Thresholds for these scores can be configured based on application needs, such as stricter thresholds for applications for kids versus gaming industries that might accept more violent content.
  • Azure AI Foundry also supports evaluating multimodal models that mix text and images for safety.

Resources 18:49

  • More information and discussions are available on the GitHub Azure AI Foundry discussions and the Azure AI Foundry Discord server.
  • The presentation slides will be shared on the Discord server.