Events are the Wrong Abstraction for Your AI Agents - Mason Egger, Temporal.io
Rethinking Software Abstractions for AI Agents 00:14
The presentation begins by using the analogy of geocentric versus heliocentric models of the solar system, noting that while both are accurate, the heliocentric model greatly simplified understanding and enabled new discoveries by shifting the frame of reference.
This analogy is applied to software development, suggesting that the chosen frame of reference (or abstraction) significantly impacts how we build and scale systems.
Scaling AI agents is fundamentally a distributed systems problem, similar to scaling microservices, and existing patterns like event-driven architecture (EDA) are commonly used.
Event-driven architecture is defined as a software design pattern where components communicate via events, enabling loose coupling and asynchronous processing.
Challenges with Event-Driven Architecture (EDA) 03:26
An example of a typical event-driven AI agent architecture is presented, featuring cron jobs, a message bus, and a dead letter queue, highlighting that much of it focuses on preventing breakage rather than core business logic.
The speaker argues that modern applications have centered events instead of the core business logic, leading to more code dedicated to event handling than the application's actual purpose.
Key issues with EDA include:
A sacrifice of clear, well-defined APIs, as events often lack the documentation and structure of traditional APIs, making it hard to understand what is produced and consumed.
Scattered business logic, fragmented across numerous services, which complicates debugging and requires extensive searching across codebases to trace event flows.
Services becoming ad-hoc state machines with local databases and caches, often leading to a lack of transactions between message processing and state updates, resulting in race conditions and difficult-to-solve bugs.
A critical argument is that while EDA is loosely coupled at runtime, it is tightly coupled at design time, a common misconception that conflates runtime flexibility with design-time rigidity.
This design-time coupling can lead to fear of iterating on architecture, as changing an event format can unknowingly break numerous downstream systems.
The speaker proposes reorienting software design to put "durable execution" at the center.
Durable execution is presented as a new category of "crash-proof execution" that allows developers to focus on what the application should achieve, rather than anticipating and mitigating every possible failure.
This approach accelerates development by making failures, though inevitable, inconsequential.
Automatic State Preservation: Durable execution applications automatically save all application state, including local variables, function calls, inputs, and outputs, eliminating the need for developers to manually manage caches or local databases for state recovery.
Virtualized Execution: Execution can span multiple processes and machines; if a process crashes, durable execution automatically resumes from the last known saved point, often without the developer's awareness.
Time Unlimited: Because it can survive crashes, durable execution allows code to run for extended periods, enabling long-duration operations (e.g., a 30-day sleep in code) that would typically be problematic.
Hardware Agnostic: Reliability is built into the software layer, not dependent on expensive fault-tolerant hardware, allowing it to run natively anywhere, including on a Raspberry Pi in outer space.
A durable execution agent architecture is shown to be significantly simpler, with automatic retries for failures (e.g., LLM downtime or rate limits) built into function calls.
State reconstruction happens automatically upon failure, allowing the workflow to continue as if no crash occurred.
Long-term state can be stored for audit purposes, and developers can solely focus on business logic without worrying about managing queues or events.
Temporal.io is introduced as an open-source, MIT-licensed platform that provides durable execution, supporting seven programming language SDKs (Go, Python, TypeScript, Ruby, .Net, Java, PHP).
Temporal is natively polyglot, allowing functions written in different languages to call each other seamlessly.
The underlying mechanism still uses events, but Temporal abstracts away the complexity of event handling, similar to how programming languages have abstracted away complexities like assembly code, memory management, and flow control over the history of software engineering.
The core message is that AI agents are fundamentally distributed systems, and durable execution provides the next level of abstraction to manage their inherent complexities.