CI in the Era of AI: From Unit Tests to Stochastic Evals — Nathan Sobo, Zed

Zed Editor and Testing Philosophy 00:20

  • Zed is an AI-enabled code editor, implemented from scratch in Rust, engineered like a video game to deliver 120 frames per second using GPU shaders.
  • The team recently launched Agentic Editing in Zed and developed an empirical approach to test its reliability.
  • Rigorous testing is fundamental; Zed would crash every 8 seconds without its tens of thousands of tests.
  • An extreme testing example involves simulating a scheduler, running 50 iterations with every random interleaving of concurrent events, allowing for replay and control of specific failures.

Embracing Stochastic Behavior with LLMs 02:12

  • Prior to AI features, Zed's tests were fully deterministic, with no flaky tests on CI.
  • The introduction of Large Language Models (LLMs) makes testing inherently non-deterministic, as a single token change in the input can lead to a completely different output.
  • The team had to embrace stochastic behavior in their testing approach for LLM-powered features.

Evolving Evaluation Approaches 02:52

  • Initial evaluations were data-driven (input/output), similar to those seen in the machine learning world.
  • From a programmatic software perspective, an eval is more like a test that passes or fails, leading to the development of more programmatic evals.
  • Programmatic evals involve compiling a headless copy of Zed, checking out a repository, running the agent, and performing assertions about its actions.
  • Granular evals were necessary because large, high-level evals made it difficult to diagnose specific failure modes.
  • A stochastic test helped identify an algorithmic problem with the GP tool, which was then driven into a more deterministic test using Tree-sitter for syntactic boundary expansion, leading to substantial agent improvement.

Addressing LLM Output Challenges 05:03

  • Initial LLM editing using tool calls streamed poorly, so the solution involved a small tool call to describe edits, followed by the model emitting old text/new text blocks.
  • Stochastic tests were implemented, such as running an eval 200 times and requiring a 100% pass rate to prevent the build from failing.
  • Common LLM-related issues and their solutions included:
    • Robust parsing for arbitrarily chunked input, tested by randomly chunking input in 100 iterations.
    • A dynamic programming fuzzy matching algorithm proved critical for handling slightly incorrect LLM outputs and was deterministically testable.
    • A deterministic streaming diff was developed to compare new and old text to dynamically decide if text was deleted or just not yet streamed.
    • The model sometimes produced empty old text tags when inserting at the document's top/end; a simple prompt addition helped, but robust handling for the remaining 1-2% of cases was needed.
    • XML tag mismatches (e.g., old text ending with new text) were reduced from 40% to 5% failure rate with a prompt fix ("always close all tags properly"), with the remaining cases handled by robust deterministic tests.
    • Indentation problems, where the model would flatten indentation, were addressed with an indent delta strategy to renormalize the indent.
    • Weird escaping behavior (e.g., HTML escape codes, double backslashes, double escaping newlines from Gemini) was primarily fixed with a pure prompt adjustment.
  • Many of the problems encountered were "stupid things" the model would do, rather than advanced machine learning problems.

Lessons Learned and Future Direction 12:22

  • Rigorous testing is fundamental to building reliable software, even with LLMs, requiring a statistical approach where tests are run 100-200 times and a pass/fail threshold is asserted.
  • The process involved starting with zoomed-out evaluations, then zooming into stochastic unit tests focused on specific aspects, and finally driving these into traditional deterministic tests.
  • No special external tools or eval frameworks were needed; the team leveraged their existing test suite infrastructure and software engineering skills.
  • Zed is open source under the GPL license, and contributions for improvement are welcomed.
  • The speaker noted that Claude 4 models now allow for efficient agentic Rust writing.