Building Applications with AI Agents — Michael Albada, Microsoft

Introduction and Context 00:00

  • Michael Albada introduces himself as a principal applied scientist at Microsoft, mentioning his work on Security Copilot and background at Uber and startups
  • The talk is a summary of a forthcoming 300-page O’Reilly book on building applications with AI agents, with sample code available in the book
  • The presentation will cover the promise and challenges of AI agents, core components for building effective systems, and common pitfalls

The Promise and Challenges of Agentic Systems 01:12

  • There is a significant increase (254% in three years) in startups labeling themselves as agentic or focused on building agents
  • While there is excitement and investment, agentic systems are challenging, often requiring multiple tool calls and operating in complex environments
  • Current benchmarks show substantial progress (50-70% on complex tasks), but perfection is unrealistic, especially for edge cases
  • It’s easy to achieve initial prototypes with 70% accuracy, but addressing the "long tail" of complex scenarios is much harder

Defining Agency in AI Systems 02:38

  • An agent is defined as an entity that can reason, act, communicate, and adapt to solve tasks, building on foundation models
  • Agency in systems is a spectrum, not a binary — it's more about utility than achieving maximum agency
  • Effectiveness of the system must not be sacrificed for additional agency; previous generations like Robotic Process Automation were effective but brittle
  • Agentic systems promise adaptability to changing inputs, but must maintain a high level of performance

Tool Use and Orchestration Patterns 04:44

  • Foundation models can leverage exposed tools (via APIs) for expanded functionality but this adds risk and requires careful selection of exposed actions
  • Tools are invoked in a loop: model generates output, tool is called, observation is fed back, repeating until the final output is produced
  • Avoid a one-to-one mapping from all APIs to agent tools; too many tools reduce overall accuracy and cause confusion
  • Tools should be logically grouped, clearly described, and feel like single human-facing actions

Orchestration and Workflow Patterns 06:36

  • Simple, standard workflow or chain patterns are encouraged for clarity, reliability, and cost-effectiveness
  • Use branching logic or decision trees for conditional flows; models can select pathways as needed
  • In domains like cybersecurity, this helps classify incidents and perform reasoning over multiple steps
  • If chains and trees become too complex, consider transitioning to more agentic (autonomous) patterns or fine-tune models to handle reasoning

Business Logic and State Management 08:01

  • Instead of relying on LLMs for fixed business logic, keep these rules external, exposing only what’s needed through tools
  • State should be managed outside the model with proper validation, ensuring deterministic outcomes

Multi-Agent Systems 08:40

  • Multi-agent systems are useful when a single agent becomes overwhelmed by too many tools; tools are grouped and delegated to specialized agents with a coordinator
  • Agent-to-agent protocols aim for cooperation among agents built by different teams, but this is still an emerging area with technical and security challenges

Evaluation and Testing 09:45

  • Strong emphasis on investing in rigorous evaluation; it is difficult to choose the right settings without high-quality evaluation sets
  • Advocates for a test-driven development style with agents, focusing on input-output definitions
  • AI architects and engineers must take responsibility for labeling and defining desired agent behavior
  • A continuous evaluation loop involves user input, human review, adding new examples, and iterative improvement

Tools for Evaluation and Optimization 11:01

  • Synthetic data generation (e.g., Intel Agent) is useful where user data isn't available
  • Microsoft’s Pirate is open sourced for red teaming agents (e.g., jailbreak testing)
  • Label Studio can help build evaluation sets; other tools like trace, textrad, and ds pi support prompt optimization and automated analysis
  • Foundation models as evaluators can analyze failures and suggest improvements, reducing reliance on anecdotal manual debugging

Observability and Monitoring 12:53

  • Generative models make it hard to fully understand system behavior post-deployment
  • Use logging, tracing, and clustering/summarization to identify failure patterns and optimize
  • Tools like OpenTelemetry can support observability in agentic systems

Common Pitfalls and Lessons Learned 13:40

  • Insufficient evaluation is the most frequent and serious limitation
  • Tool design issues: too few/poor tools, unclear descriptions, or excessive overlap causing confusion
  • Excessive system complexity should be avoided; only add complexity with proven user value
  • Weak learning loops make it difficult to improve; focus on root causes and actionable improvements

Security and Safety Considerations 14:44

  • Agentic systems introduce new potential vulnerabilities; design for safety at every layer
  • Tools like Pirate can help with security, but foundational engineering principles and human oversight are essential
  • Implement safeguards to enable fallbacks to human review in critical cases

Closing Thoughts and Outlook 15:16

  • The talk concludes with a quote from Paul Krugman reinforcing the importance of productivity
  • Albada expresses optimism that AI agents will drive a significant increase in individual productivity and capabilities