Initially, the team questioned whether AI, while creating bugs, could also detect them
Early results using AI (like Claude) to review code and spot bugs were promising, with specific examples found in their own codebase and highlighted on Twitter
However, reliance on AI can be frustrating due to surface-level or incorrect recommendations
Analyzed 10,000 comments from their and open-source codebases, feeding them to various LLMs for classification and summarization
Identified categories of code review feedback: logical bugs, accidentally committed code, performance and security concerns, documentation mismatches, stylistic issues
Noted that some issues, like "tribal knowledge," are beyond LLM’s capabilities as they rely on undocumented, institutional knowledge typically held by senior developers
Also observed that LLMs tend to leave generic code cleanliness or best practice comments that may not always be desired
Introduced upvote/downvote reactions for users to provide direct feedback on AI-generated comments; a spike in downvotes indicated overreach in AI recommendations
The downvote rate dropped to below 4%, suggesting increased alignment with user expectations
Assessed success by measuring what percentage of code review comments actually led to code changes
Found that only about 50% of human code review comments result in changes for that pull request, with reasons including deferrals, informational comments, or differing opinions
Managed to achieve a similar level (~52% in March) for AI-generated comments in prompting real changes