The Test Harness That Makes AI Agents Verify Themselves

Code generation stopped being the hard part months ago. My Claude-driven agent can scaffold a Next.js route, wire up a Supabase query, and write the component faster than I can read the diff — and that is exactly the problem. The bottleneck in 2026 is not writing code. It is trusting it. On this site I run a long-running agent that ships features end to end, and the only reason I let it run semi-autonomously is that it answers to a test harness it cannot talk its way past: a feature_list.json that tracks 57 features with hard pass/fail status, an init.sh that boots a clean dev server, tiered Playwright suites that run smoke before critical before high, and console-error capture that maps every failure back to a file and a line. The agent does not get to mark a feature "done." The harness does. Here is the loop that turns an eager code generator into something I can actually walk away from.

The real 2026 bottleneck is verification, not generation

If you spend any time reading Hacker News right now, you have seen the same thread resurface in five different costumes: when AI writes the software, who verifies it? The consensus that keeps forming is blunt. Generation speed is a solved problem. Verification capacity is not. An agent can produce a plausible-looking pull request in ninety seconds, and then a human has to spend forty minutes deciding whether it is real.

This got more urgent the moment coding agents stopped being copilots and became background workers. Cursor will build in parallel. Antigravity schedules background tasks. Claude Code runs from the web, the desktop, and Slack. The work increasingly happens while you are not watching, which means "looks done" and "is done" have quietly become two completely different claims — and only one of them ships safely.

My answer is not a smarter model or a better prompt. It is a dumb, deterministic harness that the smart, non-deterministic agent has to satisfy. The intelligence proposes. The harness disposes.

The harness, end to end

The whole system is four artifacts and one rule. The rule: the agent never reports its own success. Everything else exists to enforce that.

feature_list.json is the single source of truth

Every feature the site is supposed to have lives in one JSON file — 57 of them at last count — each with an id, a priority, a test spec reference, and a status that is one of pending, passing, or failing. The agent reads this file at the start of every session to decide what to work on, and it is forbidden from hand-editing the status field. Status is an output of the test run, never an input the agent gets to type.

That single constraint kills the most common failure mode of autonomous agents: declaring victory. An agent that can write its own grade will always give itself an A. By making the status field write-only-by-the-test-runner, the file becomes a contract instead of a diary.

init.sh makes every run start from the same place

Flaky environments produce flaky verdicts, and a flaky verdict is worse than no verdict because it teaches the agent the wrong lesson. init.sh boots the dev server, waits for it to be reachable, seeds whatever the tests need, and fails loudly if anything is off. Determinism in the environment is what makes the test results mean something.

claude-progress.txt is memory between runs

A long-running agent that forgets everything each session repeats its own mistakes forever. claude-progress.txt is an append-only log of what happened last time — which feature was touched, what passed, what is still red, what was learned. It is the difference between an agent that makes progress and an agent that walks in circles with great enthusiasm.

The verification loop the agent cannot fake

The actual loop is boring on purpose: orient, select one feature, implement, verify, and only then commit. The verification step is where the design earns its keep.

Tiered tests: smoke, then critical, then high

Tests run in tiers. Smoke tests run first because they are cheap and they catch the catastrophic stuff — did the app even boot, do the core routes return 200. There is no point running a forty-case suite against a server that 500s on the homepage. Only after smoke passes does the harness run critical-priority specs, then high-priority ones. Cheap signal first, expensive signal last, fail fast at every gate.

Console errors map back to a file and a line

Every Playwright run captures browser console output. A test can pass its assertions while the console is screaming about a hydration mismatch or an unhandled promise rejection — so a clean console is part of the definition of done, not an afterthought. When something does break, the harness maps the error back to a source file and line and attaches a suggestion for the common patterns. The agent does not get a vague "something failed." It gets "this file, this line, probably this fix," which is the difference between a useful loop and a guessing game.

Status auto-sync closes the gap

After the suite runs, npm run test:e2e:update-features writes the real results back into feature_list.json. The agent never touches that field by hand. If the tests are green, the feature flips to passing. If not, it stays red and the agent stays on it. The agent literally cannot certify its own work — the test runner does it, every time, mechanically.

The session workflow

Each session is scoped to exactly one feature. The agent orients by reading the progress log and the feature list, selects the single highest-priority incomplete item, bootstraps the environment, runs smoke tests to confirm it did not inherit a broken tree, implements the feature alongside its test, and then runs the suite until it is green with a clean console. Only then does it write a descriptive commit and update the progress log.

One feature per session is a deliberate constraint. It keeps the diff reviewable, keeps the blast radius small, and means that when something does go wrong, there is exactly one suspect. Agents are far more reliable doing one bounded thing well than ten ambitious things halfway.

Guardrails that stop an autonomous agent from making things worse

An agent with test access and no guardrails will eventually "fix" a failing test by deleting the assertion. The harness assumes this and fences it in. There is an iteration cap, so the agent cannot grind forever burning tokens on the same red test. There is a file-scope constraint, so a feature task cannot quietly rewrite unrelated parts of the codebase. After every fix, the full relevant suite reruns, so a patch that fixes one test and breaks two is caught immediately rather than committed. And the agent is explicitly taught to distinguish a test bug from an application bug before it edits anything — because the lazy fix is almost always to blame the test, and the lazy fix is almost always wrong.

None of this is exotic. It is the same discipline a good senior engineer applies to a junior's pull request, encoded so a machine applies it consistently at 3am.

What broke, and what I would do differently

The harness did not arrive fully formed. The first real lesson was that flaky tests poison everything: a test that fails one run in five teaches the agent that red is sometimes fine, which is the exact opposite of what you want. Quarantining flaky specs aggressively mattered more than adding new coverage.

The second lesson was that I over-trusted smoke tests early on. A green smoke suite means the building is standing, not that the room you just renovated is wired correctly. Smoke is necessary, never sufficient.

And the honest one: there is still a class of changes — anything touching auth, payments, or data migrations — where I do not let the loop close on its own. The harness gets the agent to a defensible, tested starting point, and then a human makes the call. Verification infrastructure raises the floor dramatically. It does not yet remove the ceiling, and pretending otherwise is how you ship a confident, fully-green disaster.

Steal this: a minimal harness for your own repo

You do not need my exact stack to get most of the value. The portable core is three pieces. First, a single machine-readable source of truth for what "done" means — a feature_list.json, a checklist, a spec folder, whatever — where status is written by tests and never by the agent. Second, a one-command bootstrap (init.sh) that produces an identical environment every time so results are trustworthy. Third, a tiered test command that fails fast and feeds errors back in a form the agent can act on, mapped to files and lines.

Wire those together with one rule — the agent proposes, the harness disposes — and you can adapt it to anything. Swap Playwright for your framework, swap the npm scripts for your task runner. The architecture is stack-agnostic because the principle is: never let the thing being graded hold the pen.

Where this goes next

The interesting frontier is that the platforms are now building this layer too. Next.js shipped experimental Agent DevTools and browser-log forwarding so an agent can see what it actually rendered. There is growing talk of formal verification gates for coding loops. Supabase added an AI-driven security advisor that flags exposed tables before they become incidents. The industry is independently converging on the same conclusion I reached by hand: in a world where the code writes itself, the durable engineering skill is designing the system that decides whether to trust it.

For now, my agent ships features while I sleep, and I wake up to a green board or a precise red one. That is not because the model got smart enough to trust. It is because I stopped asking it to grade its own homework.

See it: the verification loop

I sketched the whole loop as an interactive Excalidraw whiteboard — feature list, to the agent implementing one feature, to a clean init.sh boot, to the tiered test run, to console-error mapping, to the status auto-sync that writes the verdict, and back to the top. Explore it here: https://etomco.com/whiteboards/verification-loop-ai-agent-test-harness