Every pull request that sits in a queue for four hours is a small tax on a team's momentum. A code-review agent pays that tax for you: the moment a PR opens, it gets a careful first read — the tests run, the risky lines get flagged, and the human reviewer walks in already pointed at the part that matters. That payoff is why this is one of the most satisfying agents to build. It's also one of the easiest to build badly, because the obvious version produces a wall of noise nobody reads. Here's how I'd structure one that people actually keep on.
First, the distinction that shapes everything else. A skill does one bounded thing — "summarize this diff," "check for missing tests." An agent is handed a goal and the autonomy to chase it: read files, run tools, check its own work, loop until it's satisfied. Reviewing a PR is squarely agent territory, because you don't know in advance how many files it needs to open or whether the tests will even pass. You're delegating judgment, not a single function call.
Start with the job, not the tools
The goal isn't "comment on the code." It's "catch the thing a tired reviewer would miss at 6pm on a Friday." Write that sentence down somewhere the agent can see it, because it quietly reorders every priority. Correctness and footguns rise to the top. Naming opinions, import order, and whether you prefer map over a for loop sink to the bottom where they belong.
Concretely, here's the split I want the agent to internalize.
Worth a comment:
- A null or empty case that isn't handled and will be hit in normal use.
- A database query inside a loop, or a query with no index behind it.
- An error that's caught and silently swallowed.
- A change to auth, permissions, or money math.
- A test that was deleted instead of fixed.
Not worth a comment:
- Variable names you'd have chosen differently.
- A function that's eight lines instead of your preferred five.
- Formatting a linter already owns.
- Personal style preferences dressed up as standards.
If an agent can't tell these two columns apart, it isn't a reviewer. It's a very fast way to generate noise.
The loop
A review agent is a small, well-behaved loop. Nothing exotic — the value is in what it does at each step, not the control flow.
- Read the diff, then read the surrounding files it touches. A diff out of context lies constantly.
- Run the test suite and actually read the output, including failures and skipped tests.
- Form an opinion: what's broken, what's risky, what's merely style.
- Write comments that point at specific lines with specific reasons.
- Summarize the overall risk in two sentences a busy human can read in five seconds.
goal: surface real risk in this PR
tools: read_file, run_tests, post_comment
for file in changed_files:
diff = read_file(file)
context = read_file(neighbors_of(file))
risks += assess(diff, context)
test_output = run_tests()
risks += assess_failures(test_output)
for r in top_n(risks, 5):
post_comment(r.line, r.reason, r.suggested_fix)
post_summary(two_sentence_risk_assessment(risks))
stop when: every changed file has been considered
# note: never call approve() or merge() — those tools don't exist hereThat last comment is not decoration. The fastest way to keep an agent on the right side of the line is to never hand it the tool that crosses it.
Give it taste, not a rulebook
This is where most review agents live or die. A linter already catches formatting; if your agent's whole personality is a list of style rules, you've built a slower linter. The thing buyers can't get from a free template is judgment — the missing null check, the N+1 query, the race condition that only shows up under load.
You build that judgment by feeding the agent the bugs that have actually hurt you, not an abstract checklist. "Here are six incidents from our last year and the one-line change that would have prevented each." That's worth more than two hundred rules, because it teaches the agent what risk smells like instead of asking it to pattern-match strings.
A real bug, caught
Make this concrete. A PR refactors a checkout function. The diff looks clean — a few lines moved, a helper extracted, tests still green. Here's the change:
- const total = subtotal - discount
+ const total = subtotal - applyDiscount(discount)The linter is happy. The tests pass, because every test fixture happens to use a percentage discount. But the agent opened applyDiscount and read it, and noticed the function returns undefined when the discount type is fixed rather than percent. Subtract undefined from a number in JavaScript and you get NaN. So the moment a real customer used a fixed-amount coupon, their order total became NaN, the charge silently failed, and the order still got created.
No test caught it because no fixture exercised the fixed-discount path. A human skimming a green PR at 6pm wouldn't catch it either. The agent did — not because it had a rule about coupons, but because it read the function the diff called and traced what came back. The comment it left:
applyDiscountreturnsundefinedfortype: 'fixed', sototalbecomesNaNhere and the charge will fail silently. There's no test covering fixed discounts — add one, and guard the return.
That is the entire job in one comment: specific line, specific failure mode, specific fix, and a note about the missing test.
Make it explain itself
A comment that says "this is wrong" is useless and slightly insulting. A comment that says "this throws when items is empty, which happens on first render — guard it" is a gift. Require the agent to always pair a flag with a reason and a suggested fix. Three parts, every time: what, why, and what to do about it. If it can't articulate the why, it probably doesn't have a real finding — and that's a useful filter on its own.
Where the human stays
An agent can review code. It should not be the one that approves and merges it.
Keep the human on the merge button. Not as a courtesy — as architecture. The agent's job is to make the human's review faster and sharper, not to replace the human's name in the merge log. Remember the model: a one-time-purchase agent you download and run locally is powerful precisely because it acts on its own. That same autonomy is exactly why you don't wire it to approve and merge. The first time an agent confidently green-lights a change that takes down production at 2am, you will be deeply grateful you drew this line in ink.
Review is advisory. Merging is a human decision with a human accountable for it. Build it that way and never apologize for it.
Keeping it from being annoying
The failure mode of a review agent isn't being wrong. It's being exhausting. A few rules keep it welcome:
- Cap the nitpicks. Five sharp comments beat fifty noisy ones, and the team will read all five.
- Let it say "this looks fine." Silence is a valid review, and an agent that approves of nothing teaches people to ignore it.
- Lead with the one thing that matters most. If the summary buries the
NaNbug under three style notes, you've failed. - Tune the volume down over time, not up. The instinct is always to add more checks. Resist it.
Why this sells
Teams will happily pay — once — for an agent that makes every review better and frees their senior people for the genuinely hard calls. The judgment is the product. Anyone can ship a wrapper that posts lint warnings. What people download and keep is the agent that reads the function the diff actually calls, catches the NaN before a customer does, explains itself in one clear comment, and then hands the merge decision back to a person. Build that, keep the human on the button, and you've made something a team trusts a little more with every PR it gets right.
Devon Park
Developer Advocate
Writing for the Skillmint blog on how people build, price, and put Claude Skills & Agents to work.