Evaluation should commence at the very beginning of an AI development project.
There are four distinguished layers for agent evaluation: the model and safety system (platform-specific, built into Azure), system message and grounding, and user experience.
The foundation model is only one part of safety; true safety is achieved by layering smart mitigations at the application level.
Manual Model Evaluation with VS Code AI Toolkit 04:16
The first step is manual model evaluation to understand how different models respond to specific prompts, which can reveal nuances missed by automatic metrics.
The VS Code AI Toolkit plugin allows developers to compare model responses side-by-side within their development environment.
For example, GPT-4.1 offers significant throughput improvements over GPT-4.0, delivering answers much faster, and was preferred for quality in a specific recipe example.
Once a foundation model is selected, the next step is to evaluate the entire AI agent system end-to-end.
The AI Toolkit in VS Code facilitates rapid building and evaluation of AI agents.
An example agent was demonstrated that extracts agenda and event information from web pages, customizing a sample web scraper agent to use GPT-4.1.
This agent used Playwright to navigate web pages, extracting and combining information (e.g., event name, date, location, and attendee count from a Luma page linked from a Reactor page).
The evaluation tab in the AI Toolkit allows running the agent on multiple inputs, reviewing responses, and exporting results to a JSON file for further analysis.
Scaling Automated Evaluations with Azure AI Foundry 13:21
To scale beyond local manual checks, automated evaluation is crucial for thorough and wide-ranging checks.
Azure AI Foundry provides a wide range of built-in evaluators, including AI-assisted quality checks (groundedness, fluency, coherence), classic NLP metrics (F1, BLEU, ROUGE), and AI-assisted risk and safety evaluators.
Evaluations can be performed via the Azure Foundry portal or programmatically using Python code, as demonstrated in a notebook.
The evaluate function takes evaluators and a dataset to bulk evaluate an AI agent on various metrics, providing scores typically between 1 and 5.
Thresholds for these scores can be configured based on application needs, such as stricter thresholds for applications for kids versus gaming industries that might accept more violent content.
Azure AI Foundry also supports evaluating multimodal models that mix text and images for safety.