SUMM

Ian Butler introduces himself as CEO of Bismouth, an end-to-end agentic coding solution.
Bismouth has conducted several months of evaluation (evals) on agent performance in finding and fixing bugs.
A benchmark was released covering these agent assessments.
Agents currently have a low success rate in detecting real bugs and generate a significant number of false positives.
Popular agents like Devon and Cursor exhibit less than a 10% true positive rate for bug finding.
Agents often overwhelm developers with false bug reports, leading to inefficiency and "bad vibes" in coding workflows.

Three out of six agents tested had a true positive rate of 10% or lower from over 900 reports.
One agent produced 70 issues for a single task, all of which were false.
Most developers will not sift through a large volume of inaccurate reports.
Cursor, for example, showed a 97% false positive rate across 100+ repositories and 1,200+ issues.
High false positive rates cause alert fatigue and reduce developer trust, resulting in some bugs reaching production.

Use bug-focused rules: customize agent rules files with scoped, detailed instructions, especially about security and logic bugs.
Context management is critical: agents lose track of logic over time and struggle with large or complex codebases.
Most significant bugs are complex and embedded in multi-step processes that are hard for agents to follow without strong context.

Incorporate specific security-related information (like OWASP Top 10) into agent instructions to bias models towards relevant bug detection.
Clearly name and specify explicit classes of bugs in rules files for better detection (e.g., “look for auth bypasses” rather than “look for bugs”).
Require fix validation: instruct agents to write and run tests to confirm bugs have been resolved before merging code.
Structured rules reduce vague requests and alert fatigue, leading to higher quality agent output.

Agents struggle with cross-repository navigation and often summarize or drop context as file limits are reached, reducing bug-detection capability.
Users should proactively manage agent context by supplying code diffs and ensuring key files remain within the context window.
Requesting component inventories from agents (e.g., index of classes, variables) can help the agent better understand and find bugs.

“Thinking models” show greater success in identifying and following complex bug patterns within codebases.
These models follow thought traces and can dive deeper into code reasoning, uncovering more complex issues.
There is considerable run-to-run variability: even with thinking models, agents do not take a fully holistic view, producing inconsistent results across multiple runs.
Users often need to run agents many times for a comprehensive bug report, which remains a known limitation of current technology.

Bismouth offers automated PR creation, scans for vulnerabilities, provides code reviews, and integrates with GitHub, GitLab, Jira, and Linear.
On-premises deployments are available for organizations with specific requirements.
A QR code links to the full benchmark, methodology, results, and data sets for public review.
Ian encourages viewers to access the benchmark for detailed insights on agent performance in finding and fixing bugs.

How to Improve your Vibe Coding — Ian Butler