22/22 Tests Pass. The System Was 100% Broken.
22 tests. 22 passing. I looked at the summary, felt good about it, and moved on.
The entire flagged-content detection pipeline had never worked. Not once. Every file that should have been blocked had sailed through the gate untouched. The test suite was green the whole time.
I was building a system Iâll call noGlaze â a set of shell hooks that audit AI-generated code before it gets committed. The core flow is simple: an audit hook evaluates each file and writes a verdict, then a pre-push gate reads those verdicts and blocks anything flagged. Quality control, automated.
The audit hook writes verdicts in uppercase: FLAGGED.
The gate script greps for flagged â lowercase.
grep is case-sensitive by default.
Thatâs it. Thatâs the whole bug. The gate was checking for a string that would never appear. Every FLAGGED file looked clean because the gate was asking the wrong question.
Hereâs what made this particularly hard to catch: the test for the block path wasnât testing the block path.
Test 18 was described as âblock path â flagged file should be rejected.â But when I looked at what it actually did, it was testing a passing file. The test name said BLOCK. The test body said PASS. And because the code and tests were written by the same agent in the same session, both shared the same mental model â one where the case mismatch didnât exist, because neither the code nor the tests had ever surfaced it.
22 tests passing wasnât evidence the system worked. It was evidence that the tests and the code were generated by the same process, with the same blind spot, at the same time.
The bug surfaced in the next session. I ran four independent auditors â a different model, fresh context, no knowledge of the original implementation â against all the outputs from the build session. One of them found it.
The fix took a single character: grep -ci. Case-insensitive, count mode. I also added three new tests specifically for case-insensitive matching (tests 23, 24, 25). The suite went to 25/25. But more importantly, the new tests would have failed against the original code. They were actually checking the right thing.
The original 22 were not.
The noGlaze bug is a clean example of a credence good problem.
A credence good is something whose quality you canât assess even after you consume it. You canât tell whether your mechanic replaced the part or just charged you for it. You canât evaluate the quality of the surgery you just had. The outcome looks the same either way â until it doesnât, and by then something has already gone wrong.
Agent-produced test suites are credence goods. The test output says 22/22. That looks like quality. But you canât verify the quality of the tests by looking at whether they pass. Passing tests only tell you the code is consistent with the tests. They say nothing about whether the tests are any good.
Thereâs a paper on AI agent evaluation â AgentBreeder [2502.00757] â that documents the same pattern in a different domain. Agents optimizing for a single safety metric learned to default-refuse inputs. 95.2% safety score. Helpfulness dropped 43%. The metric looked great. The system was useless. When quality can only be observed through the metric, the metric gets gamed â not through intent, but through the same structural gap that made Test 18 lie about what it was testing.
Akerlofâs lemons paper [1970, QJE] is the underlying theory. When buyers canât distinguish good cars from bad ones, sellers of good cars exit the market. Quality collapses to whatever is indistinguishable from quality. Agent-written tests are doing something similar: theyâre indistinguishable from good tests until someone actually checks what they verify.
Galster et al. [2602.14690] surveyed 2,923 Claude Code repositories and found that 85.5% of Skills contain only a Markdown file â documentation standing in for actual enforcement. The same pattern at scale: the appearance of quality infrastructure without the verification to back it up.
The noGlaze case also had a second bug I didnât mention yet: 100% of a log file called delegation.jsonl was corrupted. The writer used jq -n which produces multi-line pretty-printed JSON. JSONL requires one object per line, no line breaks. Every record was invalid. The file looked fine in a text editor. A downstream reader would have choked on the first parse.
These werenât the only issues. The audit found three systematic failure patterns across the build session: cross-file naming inconsistency (same concept spelled differently in different scripts), happy-path-only tests (the failure modes were structurally absent, not just missed), and replicated bad patterns (one incorrect approach copy-pasted across multiple files because it was internally consistent).
All of these looked correct at the surface. All of them required an outside perspective to find.
The fix isnât to distrust agents. Itâs to separate generation from verification.
The same agent that writes the code should not also write the tests that validate the code, in the same context, in the same session. Thatâs not a test suite. Thatâs the code checking itself. TextGrad [2406.07496] made this an explicit architectural invariant: the analysis component (what is wrong) must be separate from the synthesis component (what to change). The backward engineâs prompt literally says âDO NOT propose a new version of the variable.â Detection and remediation in the same context is the failure mode both papers were designed to prevent.
Independent review â different model, clean context, no shared history with the original work â is what found the case sensitivity bug, the JSONL corruption, and the three patterns. Not because those auditors were smarter. Because they werenât the ones who wrote the code. ADAS [2408.08435] found the same principle in agent architecture search: starting with no prior knowledge produced better results than seeding with existing designs. The seeds constrained exploration. Fresh context is a feature, not a limitation.
One sentence: agent outputs need independent verification, not because agents are unreliable, but because the same process that produces the output also produces the blind spots.