Break It 'Til You Make It: Building the Self-Improving Stack for AI Agents - Aparna Dhinakaran

Introduction to Agent Evaluation 00:02

  • Aparna Dhinakaran introduces herself and Arise, a company focused on building development tools for AI agents.
  • The discussion will cover agent evaluation, observability, and monitoring in applications.

Challenges in Building Agents 00:15

  • Building AI agents requires extensive iteration on prompts, models, and tool call definitions.
  • Many teams rely on informal methods for evaluating agents, leading to challenges in systematic tracking of improvements.
  • Identifying bottlenecks in production is difficult, making it hard to enhance agent performance.

Evaluating Tool Calls 02:11

  • Importance of evaluating tool calls made by agents, including the correctness of the calls and the arguments passed.
  • The evaluation process involves analyzing traces of interactions to see where agents perform well or poorly.

High-Level Evaluation of Agent Performance 04:20

  • Need to assess the overall performance of agents across multiple interaction paths, rather than just individual responses.
  • Dhinakaran discusses the architecture of their agent and the orchestration of tool calls.

Analyzing Specific Interactions 07:00

  • Focus on search question performance, identifying poor accuracy and areas needing improvement.
  • Evaluating specific traces helps understand failures in tool calls and argument passing.

Trajectory Evaluation 09:05

  • Evaluating the order of tool calls is crucial to ensure agents complete tasks effectively.
  • Consistency in tool calling order impacts performance and response quality.

Multi-Turn Conversation Evaluation 10:21

  • Multi-turn interactions require tracking context and consistency across turns.
  • Evaluating how agents maintain context helps assess their conversational effectiveness.

Iterative Improvement Processes 12:00

  • Emphasis on the importance of refining both agent evaluation prompts and application prompts.
  • Continuous feedback loops for improving evaluation methods and identifying failure cases enhance overall product quality.

Conclusion and Resources 13:47

  • Dhinakaran highlights the significance of proper evaluations for creating effective AI agents.
  • Encourages viewers to explore Arise Phoenix, an open-source product for testing agent evaluations in applications.