SUMM

Tomas Reimers introduces himself as co-founder of Graphite and presents on "AI powered entomology," focusing on bug detection
Graphite’s main product is Diamond, an AI-powered code reviewer compatible with GitHub
The volume of AI-generated code is increasing, and so is the corresponding number of bugs

Initially, the team questioned whether AI, while creating bugs, could also detect them
Early results using AI (like Claude) to review code and spot bugs were promising, with specific examples found in their own codebase and highlighted on Twitter
However, reliance on AI can be frustrating due to surface-level or incorrect recommendations

AI-generated feedback can be technically accurate but unhelpful or even annoying, especially when compared to the same feedback from humans
The team observed that developers find certain comments from AI less acceptable than from human reviewers
They identified two axes: the ability of AI to catch issues, and the willingness of humans to accept such comments from AI

Analyzed 10,000 comments from their and open-source codebases, feeding them to various LLMs for classification and summarization
Identified categories of code review feedback: logical bugs, accidentally committed code, performance and security concerns, documentation mismatches, stylistic issues
Noted that some issues, like "tribal knowledge," are beyond LLM’s capabilities as they rely on undocumented, institutional knowledge typically held by senior developers
Also observed that LLMs tend to leave generic code cleanliness or best practice comments that may not always be desired

Focused on AI comments that are both within the AI’s capability and valued by human reviewers
Improved prompts to encourage this targeted feedback, which increased user satisfaction
Tracked the types of comments the AI made and continually reviewed to ensure they remained relevant and helpful

Introduced upvote/downvote reactions for users to provide direct feedback on AI-generated comments; a spike in downvotes indicated overreach in AI recommendations
The downvote rate dropped to below 4%, suggesting increased alignment with user expectations
Assessed success by measuring what percentage of code review comments actually led to code changes
Found that only about 50% of human code review comments result in changes for that pull request, with reasons including deferrals, informational comments, or differing opinions
Managed to achieve a similar level (~52% in March) for AI-generated comments in prompting real changes

Concludes that using LLMs to detect bugs and give code review feedback is viable when properly targeted
Invites attendees to try the Diamond product and visit their booth for more information

AI powered entomology: Lessons from millions of AI code reviews — Tomas Reimers, Graphite