SUMM

Zed Editor and Testing Philosophy 00:20

Zed is an AI-enabled code editor, implemented from scratch in Rust, engineered like a video game to deliver 120 frames per second using GPU shaders.
The team recently launched Agentic Editing in Zed and developed an empirical approach to test its reliability.
Rigorous testing is fundamental; Zed would crash every 8 seconds without its tens of thousands of tests.
An extreme testing example involves simulating a scheduler, running 50 iterations with every random interleaving of concurrent events, allowing for replay and control of specific failures.

Embracing Stochastic Behavior with LLMs 02:12

Prior to AI features, Zed's tests were fully deterministic, with no flaky tests on CI.
The introduction of Large Language Models (LLMs) makes testing inherently non-deterministic, as a single token change in the input can lead to a completely different output.
The team had to embrace stochastic behavior in their testing approach for LLM-powered features.

Evolving Evaluation Approaches 02:52

Initial evaluations were data-driven (input/output), similar to those seen in the machine learning world.
From a programmatic software perspective, an eval is more like a test that passes or fails, leading to the development of more programmatic evals.
Programmatic evals involve compiling a headless copy of Zed, checking out a repository, running the agent, and performing assertions about its actions.
Granular evals were necessary because large, high-level evals made it difficult to diagnose specific failure modes.
A stochastic test helped identify an algorithmic problem with the GP tool, which was then driven into a more deterministic test using Tree-sitter for syntactic boundary expansion, leading to substantial agent improvement.

Addressing LLM Output Challenges 05:03

Initial LLM editing using tool calls streamed poorly, so the solution involved a small tool call to describe edits, followed by the model emitting old text/new text blocks.
Stochastic tests were implemented, such as running an eval 200 times and requiring a 100% pass rate to prevent the build from failing.
Common LLM-related issues and their solutions included:
- Robust parsing for arbitrarily chunked input, tested by randomly chunking input in 100 iterations.
- A dynamic programming fuzzy matching algorithm proved critical for handling slightly incorrect LLM outputs and was deterministically testable.
- A deterministic streaming diff was developed to compare new and old text to dynamically decide if text was deleted or just not yet streamed.
- The model sometimes produced empty old text tags when inserting at the document's top/end; a simple prompt addition helped, but robust handling for the remaining 1-2% of cases was needed.
- XML tag mismatches (e.g., old text ending with new text) were reduced from 40% to 5% failure rate with a prompt fix ("always close all tags properly"), with the remaining cases handled by robust deterministic tests.
- Indentation problems, where the model would flatten indentation, were addressed with an indent delta strategy to renormalize the indent.
- Weird escaping behavior (e.g., HTML escape codes, double backslashes, double escaping newlines from Gemini) was primarily fixed with a pure prompt adjustment.
Many of the problems encountered were "stupid things" the model would do, rather than advanced machine learning problems.

Lessons Learned and Future Direction 12:22

Rigorous testing is fundamental to building reliable software, even with LLMs, requiring a statistical approach where tests are run 100-200 times and a pass/fail threshold is asserted.
The process involved starting with zoomed-out evaluations, then zooming into stochastic unit tests focused on specific aspects, and finally driving these into traditional deterministic tests.
No special external tools or eval frameworks were needed; the team leveraged their existing test suite infrastructure and software engineering skills.
Zed is open source under the GPL license, and contributions for improvement are welcomed.
The speaker noted that Claude 4 models now allow for efficient agentic Rust writing.

CI in the Era of AI: From Unit Tests to Stochastic Evals — Nathan Sobo, Zed

Zed Editor and Testing Philosophy 00:20

Embracing Stochastic Behavior with LLMs 02:12

Evolving Evaluation Approaches 02:52

Addressing LLM Output Challenges 05:03

Lessons Learned and Future Direction 12:22