Production software keeps breaking and it will only get worse — Anish Agarwal, Traversal.ai

The Challenge of Production Software Troubleshooting 00:01

  • Software engineering consists of system design, development including devops, and production troubleshooting.
  • AI tools are streamlining development, but system design and troubleshooting remain key challenges.
  • The ideal future would have AI handle both coding and troubleshooting, freeing engineers to focus on creative system design, but this vision faces obstacles.
  • As AI generates more code, human engineers will have less context and understanding, especially when systems become more complex.
  • The difficulty of troubleshooting will increase, leading to more time spent on on-call and incident management.

Current Troubleshooting Workflow and Its Drawbacks 02:55

  • Engineers rely heavily on dashboards from tools like DataDog, Grafana, Splunk, and Sentry to monitor system health.
  • When issues arise, teams engage in "dashboard dumpster diving," searching through vast numbers of dashboards and logs for clues.
  • Root cause analysis involves tracing issues back to pull requests or configuration changes, often requiring large-scale collaboration in incident channels.
  • The process is time-consuming, inefficient, and likely to worsen as systems grow more complex and human context is reduced.

Why Traditional AI Approaches to Troubleshooting Fall Short 04:23

  • AIOps methods using traditional machine learning and statistical anomaly detection generate excessive false positives and do not scale with system complexity.
  • LLMs like ChatGPT can analyze individual logs but cannot process the immense volume (terabytes/trillions of logs) present in real-world systems.
  • LLMs also lack strong numerical data understanding and are limited by context and memory constraints.
  • Agent-based approaches relying on runbooks fail because runbooks quickly become outdated, and exhaustive agent searches take too long for urgent production issues.
  • No single existing method is adequate for efficient, autonomous troubleshooting.

Traversal.ai's Approach: Combining Causal ML, Semantics & Agentic Control 07:11

  • Traversal.ai aims to achieve autonomous troubleshooting for previously unseen issues through a hybrid approach.
  • The system integrates causal machine learning (to identify cause-effect rather than mere correlation), advanced reasoning models for semantic understanding, and a novel agentic control flow.
  • "Swarm of agents" architecture enables thousands of parallel, agent-driven tool calls for exhaustive and efficient telemetry searching.
  • This approach combines statistical (causal ML), semantic (reasoning models), and agent-based (swarms) techniques for comprehensive automated debugging.

Real-World Impact and Case Study: Digital Ocean 10:00

  • Before traversal, Digital Ocean engineers faced frequent, time-consuming incident investigations often involving dozens of participants and billions of logs.
  • With traversal, their meantime to resolution (MTR) has dropped by about 40%, saving significant time and money per incident.
  • Traversal's AI conducts parallelized, exhaustive analysis, surfacing likely root causes to engineers within about five minutes.
  • It provides actionable findings, confidence levels, relevant data, and visualizations, such as impact maps, directly to incident channels and UI.
  • Engineers can interact further, querying the system for tailored insights about specific parts of the stack.

Scalability, Broader Applications, and Team 15:56

  • Traversal operates at large scale, processing data from a wide array of observability tools across client enterprises—handling trillions of logs.
  • The methods used (exhaustive agent swarms, causal reasoning) have potential applicability beyond observability, including network monitoring and cyber security, where similar "needle in a haystack" problems exist.
  • Team comprises experts in AI research, dev tools, product engineering, and high-frequency trading, valued for their collaboration-focused and supportive culture.
  • Prospective collaborators or interested parties are invited to engage.