SUMM

Michael Albada introduces himself as a principal applied scientist at Microsoft, mentioning his work on Security Copilot and background at Uber and startups
The talk is a summary of a forthcoming 300-page O’Reilly book on building applications with AI agents, with sample code available in the book
The presentation will cover the promise and challenges of AI agents, core components for building effective systems, and common pitfalls

There is a significant increase (254% in three years) in startups labeling themselves as agentic or focused on building agents
While there is excitement and investment, agentic systems are challenging, often requiring multiple tool calls and operating in complex environments
Current benchmarks show substantial progress (50-70% on complex tasks), but perfection is unrealistic, especially for edge cases
It’s easy to achieve initial prototypes with 70% accuracy, but addressing the "long tail" of complex scenarios is much harder

An agent is defined as an entity that can reason, act, communicate, and adapt to solve tasks, building on foundation models
Agency in systems is a spectrum, not a binary — it's more about utility than achieving maximum agency
Effectiveness of the system must not be sacrificed for additional agency; previous generations like Robotic Process Automation were effective but brittle
Agentic systems promise adaptability to changing inputs, but must maintain a high level of performance

Foundation models can leverage exposed tools (via APIs) for expanded functionality but this adds risk and requires careful selection of exposed actions
Tools are invoked in a loop: model generates output, tool is called, observation is fed back, repeating until the final output is produced
Avoid a one-to-one mapping from all APIs to agent tools; too many tools reduce overall accuracy and cause confusion
Tools should be logically grouped, clearly described, and feel like single human-facing actions

Simple, standard workflow or chain patterns are encouraged for clarity, reliability, and cost-effectiveness
Use branching logic or decision trees for conditional flows; models can select pathways as needed
In domains like cybersecurity, this helps classify incidents and perform reasoning over multiple steps
If chains and trees become too complex, consider transitioning to more agentic (autonomous) patterns or fine-tune models to handle reasoning

Instead of relying on LLMs for fixed business logic, keep these rules external, exposing only what’s needed through tools
State should be managed outside the model with proper validation, ensuring deterministic outcomes

Multi-agent systems are useful when a single agent becomes overwhelmed by too many tools; tools are grouped and delegated to specialized agents with a coordinator
Agent-to-agent protocols aim for cooperation among agents built by different teams, but this is still an emerging area with technical and security challenges

Strong emphasis on investing in rigorous evaluation; it is difficult to choose the right settings without high-quality evaluation sets
Advocates for a test-driven development style with agents, focusing on input-output definitions
AI architects and engineers must take responsibility for labeling and defining desired agent behavior
A continuous evaluation loop involves user input, human review, adding new examples, and iterative improvement

Synthetic data generation (e.g., Intel Agent) is useful where user data isn't available
Microsoft’s Pirate is open sourced for red teaming agents (e.g., jailbreak testing)
Label Studio can help build evaluation sets; other tools like trace, textrad, and ds pi support prompt optimization and automated analysis
Foundation models as evaluators can analyze failures and suggest improvements, reducing reliance on anecdotal manual debugging

Generative models make it hard to fully understand system behavior post-deployment
Use logging, tracing, and clustering/summarization to identify failure patterns and optimize
Tools like OpenTelemetry can support observability in agentic systems

Insufficient evaluation is the most frequent and serious limitation
Tool design issues: too few/poor tools, unclear descriptions, or excessive overlap causing confusion
Excessive system complexity should be avoided; only add complexity with proven user value
Weak learning loops make it difficult to improve; focus on root causes and actionable improvements

Agentic systems introduce new potential vulnerabilities; design for safety at every layer
Tools like Pirate can help with security, but foundational engineering principles and human oversight are essential
Implement safeguards to enable fallbacks to human review in critical cases

The talk concludes with a quote from Paul Krugman reinforcing the importance of productivity
Albada expresses optimism that AI agents will drive a significant increase in individual productivity and capabilities

Building Applications with AI Agents — Michael Albada, Microsoft