Few things give me more dread than reviewing the mediocre code written by an overconfident LLM, but arguing in a PR with an overconfident LLM that its review comments are wrong is up there.
I can’t agree more. I’m torn on LLM code reviews. On the one hand I think it is a place that makes a lot of sense and they can quickly catch silly human errors like misspelled variables and whatnot.
On the other hand the amount of flip flopping they go through is unreal. I’ve witnessed numerous instances where either the cursor bugbot or Claude has found a bug and recommended a reasonable fix. The fix has been implemented and then the LLM has argued the case against the fix and requested the code be reverted. Out of curiosity to see what happens I’ve reverted the code just to be told the exact same recommendation as in the first pass.
I can foresee this becoming a circus for less experienced devs so I turned off the auto code reviews and stuck them in request only mode with a GH action so that I can retain some semblance of sanity and prevent the pr comment history from becoming cluttered with overly verbose comments from an agent.
The purpose of these reviewers is to flag the bug to you. You still need to read the code around and see if its valid and serious and worth a fix. Why does it matter if it then says the opposite after the fix? Did it even happen often or is this just an anecdote of a one time thing?
It’s like a linter with conflicting rules (can’t use tabs, rewrite to spaces; can’t use spaces, rewrite to tabs). Something that runs itself in circles and can also block a change unless the comment is resolved simply adds noise, and a bot that contradicts itself does not add confidence to a change.
I have no problem accepting the odd comment that actually highlights a flaw and dismissing the rest, because I can use discretion and have an understanding of what it has pointed out and if it’s legit.
The dread is explaining this to someone less experienced, because it’s not helpful to just say to use your gut. So I end up highlighting the comments that are legit and pointing out the ones that aren’t to show how I’m approaching them.
It turns out that this is a waste of time, nobody learns anything from it (because they’re using an LLM to write the code anyway) and it’s better to just disable the integration and maybe just run a review thing locally if you care. I would say that all of this has made my responsibility as a mentor much more difficult.
The battle I am fighting at the moment is that our glorious engineering team, who are the lowest bidding external outsourcer, make the LLM spew look pretty good. The reality of course is they are both terrible, but no one wants to hear that, only that the LLM is better than the humans. And that's only because it's the narrative they need to maintain.
Relative quality is better but the absolute quality is not. I only care about absolute quality.
Do you have actual experience with bugbot? Its live in our org and is actually pretty good, almost none of its comments are frivolous or wrong, and it finds genuine bugs most reviewers miss. This is unlike Graphite and Copilot, so no one's glazing AI for AIs sake.
Bugbot is now a valuable part of our SD process. If you have genuine examples to show that we are just being delusional or haven’t hit a roadblock, I would love to know.
I assume that this is the same as when Cursor spontaneously decides to show code review comments in the IDE as part of some upsell? In that case yes I’m familiar and they were all subtly wrong.
Wait, so Cursor has multiple code review products? I dunno man, if they market the bad one at me and don’t tell me about the good one then I don’t think that’s my fault.
The biggest problem with LLM reviews for me is not false positives, but authority. Younger devs are used to accepting bot comments as the ultimate truth, even when they are clearly questionable
I alluded to it in a separate comment but the problem I have here is that it is really hard to get through to them on this too.
Upskilling a junior dev required you spend time in the code and sharing knowledge, doing pairing and such like. LLMs have abstracted a good part of that away and in doing so broken a line of communication, and while there are still many other topics that can be tackled as a mentor, the one most relevant to an upstart junior is effective programming and they will more likely disappear into Claude Code for extended lengths of time than reach out for help now.
This is difficult to work with because you’ll need to do more frequent check-ins, akin to managing. And coaching someone through a prompt and a fancy MCP setup isn’t the same as walking through a codebase, giving context, advising on idiomatic language use and such like.
Yes, I've found some really interesting bugs using LLM feedback, but it's about a 40% accuracy rate, mostly when it's highlighting things that are noncritical (for example, we don't need to worry about portability in a single architecture app that runs on a specific OS)
I've found Bugbot to be shockingly effective at finding bugs in my PRs. Even when it's wrong, it's usually worth adding a comment, since it's the kind of mistake a human reviewer would make.
Few things give me more dread than reviewing the mediocre code written by an overconfident LLM, but arguing in a PR with an overconfident LLM that its review comments are wrong is up there.
I can’t agree more. I’m torn on LLM code reviews. On the one hand I think it is a place that makes a lot of sense and they can quickly catch silly human errors like misspelled variables and whatnot.
On the other hand the amount of flip flopping they go through is unreal. I’ve witnessed numerous instances where either the cursor bugbot or Claude has found a bug and recommended a reasonable fix. The fix has been implemented and then the LLM has argued the case against the fix and requested the code be reverted. Out of curiosity to see what happens I’ve reverted the code just to be told the exact same recommendation as in the first pass.
I can foresee this becoming a circus for less experienced devs so I turned off the auto code reviews and stuck them in request only mode with a GH action so that I can retain some semblance of sanity and prevent the pr comment history from becoming cluttered with overly verbose comments from an agent.
The purpose of these reviewers is to flag the bug to you. You still need to read the code around and see if its valid and serious and worth a fix. Why does it matter if it then says the opposite after the fix? Did it even happen often or is this just an anecdote of a one time thing?
It’s like a linter with conflicting rules (can’t use tabs, rewrite to spaces; can’t use spaces, rewrite to tabs). Something that runs itself in circles and can also block a change unless the comment is resolved simply adds noise, and a bot that contradicts itself does not add confidence to a change.
I have no problem accepting the odd comment that actually highlights a flaw and dismissing the rest, because I can use discretion and have an understanding of what it has pointed out and if it’s legit.
The dread is explaining this to someone less experienced, because it’s not helpful to just say to use your gut. So I end up highlighting the comments that are legit and pointing out the ones that aren’t to show how I’m approaching them.
It turns out that this is a waste of time, nobody learns anything from it (because they’re using an LLM to write the code anyway) and it’s better to just disable the integration and maybe just run a review thing locally if you care. I would say that all of this has made my responsibility as a mentor much more difficult.
The battle I am fighting at the moment is that our glorious engineering team, who are the lowest bidding external outsourcer, make the LLM spew look pretty good. The reality of course is they are both terrible, but no one wants to hear that, only that the LLM is better than the humans. And that's only because it's the narrative they need to maintain.
Relative quality is better but the absolute quality is not. I only care about absolute quality.
Do you have actual experience with bugbot? Its live in our org and is actually pretty good, almost none of its comments are frivolous or wrong, and it finds genuine bugs most reviewers miss. This is unlike Graphite and Copilot, so no one's glazing AI for AIs sake.
Bugbot is now a valuable part of our SD process. If you have genuine examples to show that we are just being delusional or haven’t hit a roadblock, I would love to know.
I assume that this is the same as when Cursor spontaneously decides to show code review comments in the IDE as part of some upsell? In that case yes I’m familiar and they were all subtly wrong.
[flagged]
You can't ask people about their personal experience and then deny them the right to answer.
Wait, so Cursor has multiple code review products? I dunno man, if they market the bad one at me and don’t tell me about the good one then I don’t think that’s my fault.
The biggest problem with LLM reviews for me is not false positives, but authority. Younger devs are used to accepting bot comments as the ultimate truth, even when they are clearly questionable
I alluded to it in a separate comment but the problem I have here is that it is really hard to get through to them on this too.
Upskilling a junior dev required you spend time in the code and sharing knowledge, doing pairing and such like. LLMs have abstracted a good part of that away and in doing so broken a line of communication, and while there are still many other topics that can be tackled as a mentor, the one most relevant to an upstart junior is effective programming and they will more likely disappear into Claude Code for extended lengths of time than reach out for help now.
This is difficult to work with because you’ll need to do more frequent check-ins, akin to managing. And coaching someone through a prompt and a fancy MCP setup isn’t the same as walking through a codebase, giving context, advising on idiomatic language use and such like.
Yes, I've found some really interesting bugs using LLM feedback, but it's about a 40% accuracy rate, mostly when it's highlighting things that are noncritical (for example, we don't need to worry about portability in a single architecture app that runs on a specific OS)
I've found Bugbot to be shockingly effective at finding bugs in my PRs. Even when it's wrong, it's usually worth adding a comment, since it's the kind of mistake a human reviewer would make.
[dead]