Six AI agents debate your code - and produce reviews no single model could.
I call it a "Council." The Architect reviews structure. The Sentinel hunts security bugs. The Optimizer finds performance issues. The Maintainer checks test coverage. The Verifier challenges everything. And the Moderator synthesizes the verdict.
The Problem
Single-model code reviews always have blind spots. Ask GPT-4 to review code and it might catch architecture issues but miss a SQL injection. Ask it to focus on security and it might miss performance bottlenecks.
One perspective, no matter how powerful, isn't comprehensive.
I wanted reviews that covered all perspectives - without asking users to run five different prompts and synthesize the results themselves.
The First Approach: Local Models
I started with Ollama running locally: Llama 3.1, Mistral 7B, DeepSeek Coder.
The output quality was... inconsistent. Agents would miss obvious bugs. Findings were vague - "consider improving error handling" without specifying where. JSON parsing failed constantly because models would add commentary around the JSON.
The fundamental problem: these models are good at conversation but not reliable at structured, high-stakes analysis.
The Migration to Cloud
I switched to Ollama Cloud, gaining access to 120B-671B parameter models:
| Agent | Model | Parameters | Focus |
|---|---|---|---|
| Moderator | GPT-OSS | 120B | Orchestration, final verdict |
| Architect | GPT-OSS | 120B | Structure, patterns, readability |
| Sentinel | DeepSeek v3.1 | 671B | Security, bugs, edge cases |
| Optimizer | Qwen3-Coder | 480B | Performance, complexity |
| Maintainer | Devstral-2 | 123B | Tests, DX, error handling |
| Verifier | GPT-OSS | 120B | Cross-validation |
The Sentinel uses the largest model (671B) because security mistakes are the most costly. I'd rather over-engineer accuracy for that agent than risk missing a vulnerability.
The Phased Review Pipeline
Reviews proceed through four phases:
Phase 0: Intake The Moderator analyzes the code and builds a "Code Map" - what modules exist, how data flows, what state is managed, what I/O happens. This gives subsequent agents shared context.
Phase 1: Review Each specialist agent analyzes independently, producing structured findings in JSON format:
{
"id": "SENT-001",
"category": "security",
"severity": "P0",
"confidence": 0.95,
"where": { "file": "api.js", "lines": "42-48" },
"claim": "SQL injection vulnerability via unsanitized user input",
"evidence": "User input is concatenated directly into SQL query",
"impact": "Attackers can read/modify database",
"fix": "Use parameterized queries",
"patch_snippet": "db.query('SELECT * FROM users WHERE id = ?', [userId])"
}
Every finding must include evidence pointing to specific code.
Phase 2: Debate The Verifier receives all findings from other agents and cross-checks each one against the actual code. Each finding gets a verdict:
- VERIFIED: Evidence is solid, claim is accurate
- WEAK: Evidence is partial or claim is overstated
- SPECULATION: Not supported by actual code
This prevents hallucinations from making it into the final report.
Phase 3: Verdict The Moderator synthesizes everything: ranked actions (P0/P1/P2), patch suggestions, risk table, and a "what I would review next" list.
The Hallucination Problem
The biggest problem with LLM code review: agents claim bugs that don't exist.
"This function has a SQL injection vulnerability on line 47." Line 47 is a comment.
I tried prompt engineering: "only report issues you can prove exist." Didn't help.
The Verifier agent was the solution. It acts as a skeptic, forcing every finding to justify itself against the actual code. Weak evidence gets flagged. Pure speculation gets filtered out.
This adds latency (one more agent call) but dramatically improves accuracy.
Robust Error Recovery
Cloud APIs fail. Rate limits hit. Models timeout. The system needed to survive partial failures.
I created a CloudError class with typed error codes:
export class CloudError extends Error {
constructor(
message: string,
public readonly code:
| "RATE_LIMIT"
| "AUTH_ERROR"
| "MODEL_ERROR"
| "NETWORK_ERROR"
| "UNKNOWN",
public readonly retryable: boolean = false
) {
super(message);
this.name = "CloudError";
}
}
Different errors get different handling:
- Rate limits: Exponential backoff, wait (retries + 1) * 3000ms
- Auth errors: Fail immediately, don't retry
- Network errors: Retry twice with 2-second delays
- Model errors: Fail with helpful message
If one agent fails, the others continue. You get partial results rather than nothing. The UI shows which agents succeeded and which had issues.
Access Code Protection
These cloud models aren't cheap. I needed to prevent unauthorized usage.
Users enter an access code that's verified server-side. Valid codes unlock the review. Invalid codes show an error. The actual API key never leaves the server.
The code is stored in sessionStorage, so users don't have to re-enter it during a session, but it clears when they close the browser.
What I Learned
Agent diversity beats single stronger models. Five 120B models looking at code from different angles find more issues than one 671B model looking at everything. Specialization works.
Structured output schemas are essential. Without JSON schemas in the prompt, agents produce unparseable prose. With strict schemas, I get clean arrays of findings every time.
"Evidence must point to specific code" is magic. Adding this constraint to every agent prompt eliminated 80% of hallucinations. If the model can't cite a line number, it can't make the claim.
The Verifier is worth the latency. An extra 5-10 seconds of review time is worth it for accurate findings. Users would rather wait for correct results than get fast garbage.
What I'd Do Differently
Currently, agents run sequentially. I could parallelize Architect, Sentinel, Optimizer, and Maintainer - they don't depend on each other. That would cut review time by 60%.
I'd also add conflict resolution. When two agents make contradictory claims about the same code, the Moderator should explicitly address the disagreement rather than including both.
And I'd implement feedback loops. Let users mark findings as "helpful" or "wrong" and use that to improve prompts over time.
Try It
- Live Demo: cortex.sreekarreddy.com
- Project Details: /portfolio/projects/cortex
- GitHub: View Source
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.