AI powered entomology: Lessons from millions of AI code reviews — Tomas Reimers, Graphite

Introduction & Background 00:00

  • Tomas Reimers introduces himself as co-founder of Graphite and presents on "AI powered entomology," focusing on bug detection
  • Graphite’s main product is Diamond, an AI-powered code reviewer compatible with GitHub
  • The volume of AI-generated code is increasing, and so is the corresponding number of bugs

AI for Finding Bugs 00:44

  • Initially, the team questioned whether AI, while creating bugs, could also detect them
  • Early results using AI (like Claude) to review code and spot bugs were promising, with specific examples found in their own codebase and highlighted on Twitter
  • However, reliance on AI can be frustrating due to surface-level or incorrect recommendations

Limitations & Categorization of AI Feedback 01:44

  • AI-generated feedback can be technically accurate but unhelpful or even annoying, especially when compared to the same feedback from humans
  • The team observed that developers find certain comments from AI less acceptable than from human reviewers
  • They identified two axes: the ability of AI to catch issues, and the willingness of humans to accept such comments from AI

Analyzing Code Review Comments 03:55

  • Analyzed 10,000 comments from their and open-source codebases, feeding them to various LLMs for classification and summarization
  • Identified categories of code review feedback: logical bugs, accidentally committed code, performance and security concerns, documentation mismatches, stylistic issues
  • Noted that some issues, like "tribal knowledge," are beyond LLM’s capabilities as they rely on undocumented, institutional knowledge typically held by senior developers
  • Also observed that LLMs tend to leave generic code cleanliness or best practice comments that may not always be desired

What Makes “Good” AI Code Review? 06:33

  • Focused on AI comments that are both within the AI’s capability and valued by human reviewers
  • Improved prompts to encourage this targeted feedback, which increased user satisfaction
  • Tracked the types of comments the AI made and continually reviewed to ensure they remained relevant and helpful

Measurement & Success Metrics 07:39

  • Introduced upvote/downvote reactions for users to provide direct feedback on AI-generated comments; a spike in downvotes indicated overreach in AI recommendations
  • The downvote rate dropped to below 4%, suggesting increased alignment with user expectations
  • Assessed success by measuring what percentage of code review comments actually led to code changes
  • Found that only about 50% of human code review comments result in changes for that pull request, with reasons including deferrals, informational comments, or differing opinions
  • Managed to achieve a similar level (~52% in March) for AI-generated comments in prompting real changes

Conclusion & Product Invitation 10:01

  • Concludes that using LLMs to detect bugs and give code review feedback is viable when properly targeted
  • Invites attendees to try the Diamond product and visit their booth for more information