Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran

Introduction to Agent Evaluation 00:02

Aparna Dhinakaran introduces herself and Arise, a company focused on building development tools for AI agents.
The discussion will cover agent evaluation, observability, and monitoring in applications.

Challenges in Building Agents 00:15

Building AI agents requires extensive iteration on prompts, models, and tool call definitions.
Many teams rely on informal methods for evaluating agents, leading to challenges in systematic tracking of improvements.
Identifying bottlenecks in production is difficult, making it hard to enhance agent performance.

Evaluating Tool Calls 02:11

Importance of evaluating tool calls made by agents, including the correctness of the calls and the arguments passed.
The evaluation process involves analyzing traces of interactions to see where agents perform well or poorly.

High-Level Evaluation of Agent Performance 04:20

Need to assess the overall performance of agents across multiple interaction paths, rather than just individual responses.
Dhinakaran discusses the architecture of their agent and the orchestration of tool calls.

Analyzing Specific Interactions 07:00

Focus on search question performance, identifying poor accuracy and areas needing improvement.
Evaluating specific traces helps understand failures in tool calls and argument passing.

Trajectory Evaluation 09:05

Evaluating the order of tool calls is crucial to ensure agents complete tasks effectively.
Consistency in tool calling order impacts performance and response quality.

Multi-Turn Conversation Evaluation 10:21

Multi-turn interactions require tracking context and consistency across turns.
Evaluating how agents maintain context helps assess their conversational effectiveness.

Iterative Improvement Processes 12:00

Emphasis on the importance of refining both agent evaluation prompts and application prompts.
Continuous feedback loops for improving evaluation methods and identifying failure cases enhance overall product quality.

Conclusion and Resources 13:47

Dhinakaran highlights the significance of proper evaluations for creating effective AI agents.
Encourages viewers to explore Arise Phoenix, an open-source product for testing agent evaluations in applications.

Home Submit Saved