Open any team’s AGENTS.md and the first rule is almost always “do not hallucinate APIs” or “do not invent functions that don’t exist.” It’s a reasonable fear. It’s also probably not your top failure. You wrote it because the internet told you to, not because you counted.
That’s the bug. Your rules file is a defense against the failures you imagined, not the ones you committed. And imagined failures and real failures are different distributions — sometimes wildly so. The agent that you’re sure hallucinates APIs might actually be fine on APIs and instead reach for npm install on a pnpm repo, or write tests in the wrong framework, or quietly delete your error handling to make a transcript look clean. You’ll never know which, because intuition doesn’t count. It anchors.
Here’s the question worth holding onto: if you ranked every correction you made this week by frequency, what would actually sit at the top — and would it be anywhere on your rules file?
Your transcripts already contain the answer
Section titled “Your transcripts already contain the answer”The normal state is writing rules from vibes. You hit an annoying mistake, you add a line. Three weeks later the file is forty lines of accumulated grievances with no sense of which line earns its place. Nobody re-reads it. The agent re-reads it on every single turn — and you’ve handed it a priority order that reflects your recency bias, not its behavior.
The fix is to treat your own agent sessions as a dataset. Every transcript is a log of decisions; every correction you typed is a labeled failure. You already paid for this data. You just never analyzed it.
Pull the raw material first. Most agents persist sessions as JSON or JSONL on disk:
# Claude Code keeps sessions per-project under ~/.claude/projects/find ~/.claude/projects -name '*.jsonl' -mtime -14 > /tmp/recent-sessions.txtwc -l /tmp/recent-sessions.txtTwo weeks of work is usually enough to see structure. You’re not chasing statistical significance — you’re chasing the difference between “I think it hallucinates” and “nineteen of my forty-one corrections were the same import-style mistake.”
Code the failures the way a researcher codes interviews
Section titled “Code the failures the way a researcher codes interviews”Borrow a method from qualitative research: open coding, then axial coding. It sounds heavy. It’s an hour.
Open coding is the first pass: read each point where you corrected the agent, and tag it with a short label in your own words. Don’t pre-decide the categories. Just name what you see.
session 03 → ran `npm` on a pnpm reposession 03 → imported lodash when repo bans itsession 07 → swallowed the exception, returned nullsession 07 → ran `npm` on a pnpm reposession 11 → wrote vitest test in a jest projectsession 11 → ran `npm` on a pnpm repoAxial coding is the second pass: collapse the raw labels into categories and count. The labels above aren’t six problems. They’re three, and one of them is winning:
wrong-package-manager ███████ 7wrong-test-framework ███ 3silent-error-swallow ██ 2banned-dependency ██ 2api-hallucination ▏ 0There it is. The failure the internet warned you about scored zero. The failure eating your afternoons — the agent reaching for npm because that’s the statistical default of its training data — never made it onto a single rules file in your org, because it never felt dramatic enough to write down. The count promotes it. Your intuition would have buried it under “don’t hallucinate.”
Promote by frequency, and bring the receipts
Section titled “Promote by frequency, and bring the receipts”Now the rules file writes itself, in priority order. The top category becomes the top rule, stated as a hard constraint:
## Package management- This repo uses pnpm. NEVER run npm or yarn. Install with `pnpm add`.- The CI lockfile is pnpm-lock.yaml. A package-lock.json in a diff is a bug.But a rule is just an assertion, and assertions are the weakest form of context. The second-strongest move is to attach the actual failing example as a few-shot — the agent learns more from one concrete wrong→right pair than from three sentences of prohibition:
## Error handling- Do not swallow exceptions to produce clean output. WRONG: try: do_thing() except: return None RIGHT: let it raise, or raise a typed error with the original causeFrequency decides what to write. The transcript supplies how to write it. You’re no longer guessing at the failure or the phrasing — both come straight off the evidence.
And ordering is not cosmetic. The agent re-reads this file on every turn, and the top of the file gets the most attention before the context window fills with code, diffs, and tool output. A rule buried at line thirty competes with everything you’ve loaded since. Putting the highest-frequency failure first isn’t tidiness — it’s spending your scarcest resource, the agent’s attention near the top of the prompt, on the mistake it’s most likely to make next.
Frequency is the default sort, not the only one
Section titled “Frequency is the default sort, not the only one”Here’s the obvious objection: frequency under-weights the rare catastrophe. In the table above, silent-error-swallow scored 2 and wrong-package-manager scored 7 — but an npm slip gets caught by a lockfile check in CI, while a swallowed exception ships to production and surfaces three weeks later as a null where a value should be. Counting alone would rank the cheap, loud failure above the expensive, quiet one.
So don’t sort by raw count. Sort by count × blast radius. Add a second column — how bad is one instance of this — and let the product decide the order:
count cost ranksilent-error-swallow 2 ▲▲▲ highwrong-package-manager 7 ▲ medwrong-test-framework 3 ▲ lowbanned-dependency 2 ▲▲ medFrequency is still the default sort, for one reason: it’s the cheapest signal you have, and it corrects the worst bias — writing elaborate rules for the zero-frequency fear while the real failures go unwritten. But once the count has pulled the actual failures out of the noise, severity decides which of them goes first. A failure that happens twice and corrupts data outranks one that happens seven times and gets linted away. The count is your prior, not your verdict.
When the correction isn’t the agent’s fault
Section titled “When the correction isn’t the agent’s fault”Not every correction is a labeled agent failure, and treating them all as one will pollute your rules file with phantom defects. Two cases to filter out before you start counting.
First, the ambiguity case. You told the agent “add a test,” it picked vitest in a jest repo, and you corrected it. That’s not misbehavior — it’s a missing piece of context the agent had no way to know. The rule you write is the same (this repo uses jest), but the framing matters: you’re filling a gap, not policing a defect, and that reframe stops you from larding the file with adversarial “NEVER do X” lines for things the agent would have gotten right with one more sentence of context.
Second, the sample-size case. A label that appears once isn’t a pattern; it’s noise. The method only earns its keep above a threshold of repetition — you’re looking for the failure that recurs, not the one-off you happened to notice. If two weeks of sessions produce fifteen corrections and all fifteen are different, you don’t have a rules problem. You have a sample-size problem, or your own prompting is the variable that’s changing run to run — and no AGENTS.md rule fixes an inconsistent operator. Wait for the distribution to show structure before you write to it.
Make the analysis a loop, not a one-time ritual
Section titled “Make the analysis a loop, not a one-time ritual”The reason most people never do this is that reading transcripts by hand is grim. So don’t do it by hand. The coding pass is itself an agent task — and a perfect one for a subagent, because failure analysis is exactly the kind of work you want isolated in its own context window, not polluting the session that’s trying to ship code.
Run it headless on a schedule. Point an agent at the last two weeks of sessions, hand it the open-coding/axial-coding method as its instructions, and have it emit a ranked failure table plus proposed rule diffs:
claude -p "Read the session transcripts in /tmp/recent-sessions.txt. \For every point where the user corrected the agent, open-code it with a short label. \Collapse labels into categories, count each, and output a frequency-ranked table. \For the top 3, draft AGENTS.md rules with a real wrong→right example pulled from the transcript." \ --output-format text > failure-report.mdYour rules file stops being a shrine to old fears and becomes a living readout of where this specific agent, on this specific codebase, keeps failing this specific team. When the distribution shifts — you fix the pnpm problem and test-framework drift moves to the top — the ranking shifts with it, and so does what you write next.
The gap between a capable agent and a useful one was never about the model. It’s that the agent doesn’t know your context, and the cheapest place to learn it is the record of every time it already got your context wrong. Count the corrections. The top of the list is your next rule.
Stop defending against the failure the world fears. Defend against the one your own logs keep printing.
For the mechanics: see Rules for how the file is loaded each turn, Subagents for isolating the analysis pass, and Headless for running it on a schedule.