Production software keeps breaking and it will only get worse — Anish Agarwal, Traversal.ai
The Challenge of Production Software Troubleshooting 00:01
Software engineering consists of system design, development including devops, and production troubleshooting.
AI tools are streamlining development, but system design and troubleshooting remain key challenges.
The ideal future would have AI handle both coding and troubleshooting, freeing engineers to focus on creative system design, but this vision faces obstacles.
As AI generates more code, human engineers will have less context and understanding, especially when systems become more complex.
The difficulty of troubleshooting will increase, leading to more time spent on on-call and incident management.
Current Troubleshooting Workflow and Its Drawbacks 02:55
Engineers rely heavily on dashboards from tools like DataDog, Grafana, Splunk, and Sentry to monitor system health.
When issues arise, teams engage in "dashboard dumpster diving," searching through vast numbers of dashboards and logs for clues.
Root cause analysis involves tracing issues back to pull requests or configuration changes, often requiring large-scale collaboration in incident channels.
The process is time-consuming, inefficient, and likely to worsen as systems grow more complex and human context is reduced.
Why Traditional AI Approaches to Troubleshooting Fall Short 04:23
AIOps methods using traditional machine learning and statistical anomaly detection generate excessive false positives and do not scale with system complexity.
LLMs like ChatGPT can analyze individual logs but cannot process the immense volume (terabytes/trillions of logs) present in real-world systems.
LLMs also lack strong numerical data understanding and are limited by context and memory constraints.
Agent-based approaches relying on runbooks fail because runbooks quickly become outdated, and exhaustive agent searches take too long for urgent production issues.
No single existing method is adequate for efficient, autonomous troubleshooting.
Traversal.ai's Approach: Combining Causal ML, Semantics & Agentic Control 07:11
Traversal.ai aims to achieve autonomous troubleshooting for previously unseen issues through a hybrid approach.
The system integrates causal machine learning (to identify cause-effect rather than mere correlation), advanced reasoning models for semantic understanding, and a novel agentic control flow.
"Swarm of agents" architecture enables thousands of parallel, agent-driven tool calls for exhaustive and efficient telemetry searching.
This approach combines statistical (causal ML), semantic (reasoning models), and agent-based (swarms) techniques for comprehensive automated debugging.
Real-World Impact and Case Study: Digital Ocean 10:00
Before traversal, Digital Ocean engineers faced frequent, time-consuming incident investigations often involving dozens of participants and billions of logs.
With traversal, their meantime to resolution (MTR) has dropped by about 40%, saving significant time and money per incident.
Traversal's AI conducts parallelized, exhaustive analysis, surfacing likely root causes to engineers within about five minutes.
It provides actionable findings, confidence levels, relevant data, and visualizations, such as impact maps, directly to incident channels and UI.
Engineers can interact further, querying the system for tailored insights about specific parts of the stack.
Traversal operates at large scale, processing data from a wide array of observability tools across client enterprises—handling trillions of logs.
The methods used (exhaustive agent swarms, causal reasoning) have potential applicability beyond observability, including network monitoring and cyber security, where similar "needle in a haystack" problems exist.
Team comprises experts in AI research, dev tools, product engineering, and high-frequency trading, valued for their collaboration-focused and supportive culture.
Prospective collaborators or interested parties are invited to engage.