Multi-Agent Architecture for Code Review: Why One LLM Call Isn't Enough

The simplest way to build an AI code reviewer is to send the diff to an LLM and ask it to find bugs. Most tools do exactly this. And it works — to a point.

The problem is that a single LLM call is trying to do too many things at once: understand the changes, identify what's risky, find bugs, avoid false positives, and write clear comments. Each of these tasks has different requirements, and optimizing for one often hurts another.

Reviewate uses a multi-agent pipeline powered by the Claude Agent SDK where each stage has a single job. Here's why this architecture produces better results.

The Single-Pass Ceiling

In our testing, a single LLM call for code review typically achieved:

~40% recall — catches about 4 out of 10 real bugs
~25% precision — only 1 in 4 findings is actionable
~15 findings per PR — most of which are noise

These numbers were consistent across models and prompts we tested. You can improve them incrementally with better prompting, but the fundamental limitation is that one pass through the diff isn't enough context to distinguish real issues from hallucinations.

The Pipeline

Reviewate's pipeline has two core stages, surrounded by supporting agents:

1. Review

Job: Find candidate issues.

Two analyzer agents explore the codebase in parallel — each with access to code search tools (Read, Grep, Glob, Bash). They clone the repository, read the diff, search related code, and generate candidate findings: potential bugs, security issues, logic errors, and edge cases.

At this stage, we optimize for recall — catch as many real issues as possible, even at the cost of some false positives. It's easier to filter out false positives later than to recover missed bugs.

A synthesizer agent then merges findings from both analyzers, removing duplicates and resolving contradictions.

2. Fact-Check

Job: Verify each finding against the actual codebase.

This is the critical stage. The fact-checker receives each finding and has access to code search tools — it can grep the repository and read related code.

For each finding, it asks: "Can I find evidence in the actual code that this issue is real?" If it can't, the finding is discarded.

This is where precision jumps from ~30% to ~57%. The fact-checker eliminates hallucinations, misunderstandings about the codebase, and findings about code that's already handled correctly.

Supporting Stages

Around these two core stages, additional agents handle context and polish:

Issue Explorer — Fetches linked issues from the PR description for context
Deduplication — Filters findings that duplicate existing human comments on the PR
Style — Rewrites surviving findings into concise, scannable markdown

Why Separate Agents?

The key architectural decision is using separate agents instead of a single agent with multiple passes. There are three reasons:

1. Different Optimization Targets

The review stage optimizes for recall (catch everything). The fact-check stage optimizes for precision (eliminate false positives). These are opposing objectives — trying to do both in one pass forces a compromise that produces mediocre results on both.

2. Different Tool Access

Both the review agents and the fact-checker have access to code search tools (grep, file reads), but they use them differently. The reviewers explore broadly to find candidate issues. The fact-checker focuses narrowly on verifying specific claims. This separation means the fact-checker starts with fresh context and isn't biased by the reviewers' reasoning.

3. Different Model Requirements

Not every stage needs the same model. Issue exploration, synthesis, and styling can use smaller, faster models. The analyzers and fact-checker benefit from stronger reasoning models. This two-tier approach lets you optimize cost vs. quality per stage.

The Cost of Complexity

A multi-agent pipeline is more complex than a single LLM call. There are more moving parts, more configuration options, and more things that can go wrong.

The trade-off is worth it when:

Precision matters — if your team will ignore noisy findings, a single-pass tool is effectively useless
You review many PRs — the pipeline cost is fixed engineering effort; per-PR costs are comparable to single-pass tools
You need verification — for security-critical code, hallucinated findings can be worse than no findings

Practical Numbers

On the Augment Code benchmark — 50 PRs with confirmed bugs across Sentry, Grafana, Greptile, Cal.com, and Discourse (all results with Gemini 3 Flash). The single-pass baseline numbers reflect our own testing:

Architecture	Recall	Precision	Findings/PR	Time
Single-pass (our testing)	~40%	~25%	~15	~1 min
Multi-agent (Reviewate)	65.7%	57.3%	~5	< 3 min

The multi-agent pipeline catches more bugs, with fewer false positives, in a reasonable time. The only downside is the additional minutes of latency — which doesn't matter for asynchronous PR reviews.

Building Your Own

If you want to experiment with multi-agent code review, Reviewate is fully open source. The pipeline is configurable:

Swap models per stage
Adjust review prompts for your conventions

The architecture is intentionally modular — each stage is an independent agent that can be tested, tuned, and replaced independently.

Explore the architecture yourself. View the source on GitHub or read the quickstart guide.