22/22 Tests Pass. The System Was 100% Broken.

22 tests. 22 passing. I looked at the summary, felt good about it, and moved on.

The entire flagged-content detection pipeline had never worked. Not once. Every file that should have been blocked had sailed through the gate untouched. The test suite was green the whole time.


I was building a system I’ll call noGlaze — a set of shell hooks that audit AI-generated code before it gets committed. The core flow is simple: an audit hook evaluates each file and writes a verdict, then a pre-push gate reads those verdicts and blocks anything flagged. Quality control, automated.

The audit hook writes verdicts in uppercase: FLAGGED.

The gate script greps for flagged — lowercase.

grep is case-sensitive by default.

That’s it. That’s the whole bug. The gate was checking for a string that would never appear. Every FLAGGED file looked clean because the gate was asking the wrong question.


Here’s what made this particularly hard to catch: the test for the block path wasn’t testing the block path.

Test 18 was described as “block path — flagged file should be rejected.” But when I looked at what it actually did, it was testing a passing file. The test name said BLOCK. The test body said PASS. And because the code and tests were written by the same agent in the same session, both shared the same mental model — one where the case mismatch didn’t exist, because neither the code nor the tests had ever surfaced it.

22 tests passing wasn’t evidence the system worked. It was evidence that the tests and the code were generated by the same process, with the same blind spot, at the same time.


The bug surfaced in the next session. I ran four independent auditors — a different model, fresh context, no knowledge of the original implementation — against all the outputs from the build session. One of them found it.

The fix took a single character: grep -ci. Case-insensitive, count mode. I also added three new tests specifically for case-insensitive matching (tests 23, 24, 25). The suite went to 25/25. But more importantly, the new tests would have failed against the original code. They were actually checking the right thing.

The original 22 were not.


The noGlaze bug is a clean example of a credence good problem.

A credence good is something whose quality you can’t assess even after you consume it. You can’t tell whether your mechanic replaced the part or just charged you for it. You can’t evaluate the quality of the surgery you just had. The outcome looks the same either way — until it doesn’t, and by then something has already gone wrong.

Agent-produced test suites are credence goods. The test output says 22/22. That looks like quality. But you can’t verify the quality of the tests by looking at whether they pass. Passing tests only tell you the code is consistent with the tests. They say nothing about whether the tests are any good.

There’s a paper on AI agent evaluation — AgentBreeder [2502.00757] — that documents the same pattern in a different domain. Agents optimizing for a single safety metric learned to default-refuse inputs. 95.2% safety score. Helpfulness dropped 43%. The metric looked great. The system was useless. When quality can only be observed through the metric, the metric gets gamed — not through intent, but through the same structural gap that made Test 18 lie about what it was testing.

Akerlof’s lemons paper [1970, QJE] is the underlying theory. When buyers can’t distinguish good cars from bad ones, sellers of good cars exit the market. Quality collapses to whatever is indistinguishable from quality. Agent-written tests are doing something similar: they’re indistinguishable from good tests until someone actually checks what they verify.

Galster et al. [2602.14690] surveyed 2,923 Claude Code repositories and found that 85.5% of Skills contain only a Markdown file — documentation standing in for actual enforcement. The same pattern at scale: the appearance of quality infrastructure without the verification to back it up.


The noGlaze case also had a second bug I didn’t mention yet: 100% of a log file called delegation.jsonl was corrupted. The writer used jq -n which produces multi-line pretty-printed JSON. JSONL requires one object per line, no line breaks. Every record was invalid. The file looked fine in a text editor. A downstream reader would have choked on the first parse.

These weren’t the only issues. The audit found three systematic failure patterns across the build session: cross-file naming inconsistency (same concept spelled differently in different scripts), happy-path-only tests (the failure modes were structurally absent, not just missed), and replicated bad patterns (one incorrect approach copy-pasted across multiple files because it was internally consistent).

All of these looked correct at the surface. All of them required an outside perspective to find.


The fix isn’t to distrust agents. It’s to separate generation from verification.

The same agent that writes the code should not also write the tests that validate the code, in the same context, in the same session. That’s not a test suite. That’s the code checking itself. TextGrad [2406.07496] made this an explicit architectural invariant: the analysis component (what is wrong) must be separate from the synthesis component (what to change). The backward engine’s prompt literally says “DO NOT propose a new version of the variable.” Detection and remediation in the same context is the failure mode both papers were designed to prevent.

Independent review — different model, clean context, no shared history with the original work — is what found the case sensitivity bug, the JSONL corruption, and the three patterns. Not because those auditors were smarter. Because they weren’t the ones who wrote the code. ADAS [2408.08435] found the same principle in agent architecture search: starting with no prior knowledge produced better results than seeding with existing designs. The seeds constrained exploration. Fresh context is a feature, not a limitation.

One sentence: agent outputs need independent verification, not because agents are unreliable, but because the same process that produces the output also produces the blind spots.