Coding agents are getting fast enough that the hard part is no longer asking them to write code.
The hard part is knowing when to trust the result.
Scout came from that problem. I wanted a layer that sits after Codex, Claude Code, Cursor, or any other coding agent changes a repo, but before that change gets merged or handed to a human reviewer.
Not another chat window. Not another code review summary.
A verification loop.
The Hackathon Constraint
I built Scout for the OpenAI Codex Hackathon in Sydney at UTS Startups.
The build window was six hours.
That shaped the product. I did not want to build a pretty wrapper around a prompt. I wanted something a judge could run, inspect, and challenge.
Most of us already vibe-code now. That is the point. The interesting question is not whether an agent can generate code. It is whether another system can catch the mistakes that appear after an agent moves quickly.
So Scout became a post-agent verifier for AI-written code.
The Problem I Wanted To Catch
AI-written code fails in a particular way. It often looks complete before it is complete.
The model may invent a package, reference a helper that does not exist, drift away from the README, log private data while claiming to redact it, or write tests that pass without proving the real contract.
The patch can also look polished but fail when applied.
That last part matters. A repair is not useful because it sounds reasonable. It is useful when it targets the real issue, removes the risk, stays within scope, and can survive a basic execution gate.
The Shape Of Scout
The core loop is:
repo or seeded demo -> specialist scouts -> judge -> patch tournament -> execution gate -> agent handoff
Scout has three surfaces today.
The web app is the visual workflow. It shows the repo input, model profile, specialist scout cards, judge verdicts, benchmark scorecard, evidence graph, patch tournament, execution gate, and handoff receipt.
The API routes handle live review, fix generation, and patch scoring.
The local MCP server lets MCP-capable clients call Scout as a tool surface instead of only using the browser.
What Is Real
Scout is not only static demo data.
There is a deterministic seeded benchmark for reliable judging. There is also a live public GitHub path. When the OpenAI key is configured, Scout can fetch a bounded context from a public repository, run review agents through the OpenAI model path, generate repairs, and score patches through the same server-side flow used by the UI.
The public target repo is intentionally small and flawed. That was deliberate. A hackathon demo should not depend on indexing a giant repository or hoping GitHub rate limits behave. The target repo gives Scout a real repo URL, real files, real GitHub fetching, and real model calls while still keeping the proof path explainable.
The seeded path exists for repeatability. The live path exists to show this is not just a hardcoded storyboard.
Both matter.
Why I Used A Seeded Benchmark
For the demo, I did not want the whole product to depend on a random live model run.
So Scout includes a deterministic seed repo: demo://ai-written-code-seed.
It has seven planted AI-code mistakes:
- A fake auth package import.
- A nonexistent token verifier.
- README claims about rate limiting that the route does not satisfy.
- Raw email logging despite a redaction comment.
- Bearer token parsing that accepts malformed headers.
- A weak truthy auth test.
- A telemetry test that misses the privacy contract.
That gives the app a known answer key. Scout can say what it caught, what it missed, and whether the gate passed.
This is the difference between a demo that feels good and a demo that can be measured.
Specialist Scouts
I split the review into specialist lanes:
- Hallucination Scout looks for invented imports, impossible APIs, and nonexistent helpers.
- Spec Drift Scout looks for gaps between README claims, comments, names, and implementation.
- Test Theater Scout looks for tests that pass without proving the behavior they claim to protect.
A generic reviewer tends to blend these together. Scout keeps them separate first, then lets the judge layer dedupe repeated findings and label each result as confirmed, likely, or speculative.
That separation is important because I do not want the UI to pretend every model claim is proven. Speculative findings stay visible, but they are not treated as the same thing as confirmed evidence.
The Patch Tournament
The strongest part of Scout is the patch tournament.
For a selected finding, Scout generates three repair strategies:
- conservative
- idiomatic
- robust
Each patch is scored for target fit, risk removal, proof, scope control, and regression risk.
But the score is not only model preference. Scout also validates patch shape and applies candidates in a temporary workspace when repo files are available.
If a patch is malformed or cannot apply, it is marked ineligible.
That is the trust boundary I wanted. A patch can look convincing and still lose.
The Execution Gate
The execution gate is the part that makes Scout feel different from a normal AI review.
A model can write a beautiful explanation around a broken diff. Scout treats that as a failed repair, not as a slightly worse answer.
The scoring path expects a plain unified diff. Markdown fences, JSON wrappers, tool syntax, and malformed file headers are rejected before they can be treated as valid patches.
When repo files are available, Scout applies the patch in a temporary workspace. The check environment is stripped so API keys and local credentials are not inherited by candidate execution.
That gives the UI a clear gate status: applied, failed, ineligible, or unavailable.
This is the practical claim: Scout does not just rank patches. It asks whether the patch can survive the minimum conditions required to trust it.
Token Budget And Cache Awareness
Live code review can become expensive quickly if an agent rereads the whole repo every time.
Scout makes that cost visible.
The live GitHub path uses bounded file selection. It prioritizes README and agent instruction files, package and config files, source files, test files, and auth, audit, security, rate limit, and privacy-related code.
The UI exposes inspected-file count, context characters, estimated input tokens, model profile, prompt cache key, and OpenAI cached-token telemetry when the API returns it.
The prompt layout is also intentional. Static Scout rules stay stable. Dynamic repo context comes later. That gives the model a better chance to reuse cacheable context across repeated runs.
This is not only a cost detail. It is a product detail. A verification system that hides token pressure will be hard to use in real agent workflows.
MCP As The Real Use Case
The web app is the best demo surface, but the long-term product is a tool that an agent can call.
Scout includes a local stdio MCP server built with the official TypeScript SDK. It exposes:
scout_reviewscout_fixscout_score_patchscout_handoffscout_eval
It also exposes resources for the seeded manifest, seeded eval report, and demo handoff, plus prompts for review and tournament workflows.
The MCP server is tested through an official SDK client smoke test. The seeded MCP path runs offline. The live MCP path can call the same bounded GitHub and OpenAI review/fix flow when credentials are configured.
The current MCP path is local stdio. Hosted remote MCP and packaged plugin distribution are future work. I kept that boundary explicit because it is better to be precise about what works now than to oversell the demo.
What The Trace IDs Mean
Scout emits trace and receipt IDs for review, judge, fix, score, and handoff stages.
They are not Git commits.
They are checksummed receipts for a Scout run. They help answer what input was reviewed, which findings were produced, which patch won, and which steps were deterministic versus model-generated.
That makes the workflow easier to trust because the output is tied to a proof trail.
How Someone Would Use It
The intended workflow is simple.
A developer asks a coding agent to change a repo.
Before trusting the change, they run Scout on the repo or selected target.
Scout reports the highest-risk AI-code failure modes, separates confirmed findings from weaker claims, generates competing repairs, disqualifies bad patches, and produces a handoff that can be copied back into Codex or Claude Code.
In a stronger integration, this could sit behind a CLI command, local MCP tool, pre-PR check, GitHub Action, or hosted review service.
For the hackathon, the web app proves the workflow and the MCP server proves the tool boundary.
What I Learned
The biggest lesson was that AI code review needs two kinds of intelligence.
It needs model judgment to notice subtle contradictions and propose repairs.
It also needs deterministic verification to check schemas, patch formats, seeded recall, execution eligibility, checksums, and release gates.
One without the other is not enough.
Scout is my attempt to make that boundary visible.
The question is no longer just "can an agent write the code?"
The better question is "can another system prove enough of the work before I trust it?"
Leave a Comment
Comments (0)
Be the first to comment on this post.
Comments are approved automatically.