How to Improve your Vibe Coding — Ian Butler

Agentic Coding Solutions & Benchmarking 00:00

  • Ian Butler introduces himself as CEO of Bismouth, an end-to-end agentic coding solution.
  • Bismouth has conducted several months of evaluation (evals) on agent performance in finding and fixing bugs.
  • A benchmark was released covering these agent assessments.
  • Agents currently have a low success rate in detecting real bugs and generate a significant number of false positives.
  • Popular agents like Devon and Cursor exhibit less than a 10% true positive rate for bug finding.
  • Agents often overwhelm developers with false bug reports, leading to inefficiency and "bad vibes" in coding workflows.

False Positives and Real-World Impact 01:14

  • Three out of six agents tested had a true positive rate of 10% or lower from over 900 reports.
  • One agent produced 70 issues for a single task, all of which were false.
  • Most developers will not sift through a large volume of inaccurate reports.
  • Cursor, for example, showed a 97% false positive rate across 100+ repositories and 1,200+ issues.
  • High false positive rates cause alert fatigue and reduce developer trust, resulting in some bugs reaching production.

Practical Tips for Cleaner Vibe Coding 02:04

  • Use bug-focused rules: customize agent rules files with scoped, detailed instructions, especially about security and logic bugs.
  • Context management is critical: agents lose track of logic over time and struggle with large or complex codebases.
  • Most significant bugs are complex and embedded in multi-step processes that are hard for agents to follow without strong context.

Improving Agent Performance with Rules 03:03

  • Incorporate specific security-related information (like OWASP Top 10) into agent instructions to bias models towards relevant bug detection.
  • Clearly name and specify explicit classes of bugs in rules files for better detection (e.g., “look for auth bypasses” rather than “look for bugs”).
  • Require fix validation: instruct agents to write and run tests to confirm bugs have been resolved before merging code.
  • Structured rules reduce vague requests and alert fatigue, leading to higher quality agent output.

Managing Context and Agent Limitations 04:21

  • Agents struggle with cross-repository navigation and often summarize or drop context as file limits are reached, reducing bug-detection capability.
  • Users should proactively manage agent context by supplying code diffs and ensuring key files remain within the context window.
  • Requesting component inventories from agents (e.g., index of classes, variables) can help the agent better understand and find bugs.

The Superiority and Limits of Thinking Models 05:30

  • “Thinking models” show greater success in identifying and following complex bug patterns within codebases.
  • These models follow thought traces and can dive deeper into code reasoning, uncovering more complex issues.
  • There is considerable run-to-run variability: even with thinking models, agents do not take a fully holistic view, producing inconsistent results across multiple runs.
  • Users often need to run agents many times for a comprehensive bug report, which remains a known limitation of current technology.

Bismouth Capabilities and Benchmark Access 06:39

  • Bismouth offers automated PR creation, scans for vulnerabilities, provides code reviews, and integrates with GitHub, GitLab, Jira, and Linear.
  • On-premises deployments are available for organizations with specific requirements.
  • A QR code links to the full benchmark, methodology, results, and data sets for public review.
  • Ian encourages viewers to access the benchmark for detailed insights on agent performance in finding and fixing bugs.