The Relay, Not the Window

A million-token window won't fix a long build. Externalize state to a plan file and run a relay across fresh sessions, because quality dies long before the window fills.

The Relay, Not the Window

An hour into a big feature, your agent goes senile.

It re-adds a helper it deleted twenty minutes ago. It annotates a function with any after spending the morning being strict about types. You ask it to wire up the new endpoint and it asks, politely, where the router lives — the router it edited three times already. The window isn’t full. The work isn’t done. And the model you trusted at 9am is now guessing.

The instinct is to blame the model, or to reach for a bigger context window. Both are wrong. The thing that broke isn’t intelligence. It’s position.

Capability is highest early, and you’re spending it on nothing

Section titled “Capability is highest early, and you’re spending it on nothing”

Here’s the part the spec sheets don’t advertise. A model’s ability to reason over its context isn’t flat across the window. It’s sharpest in the first stretch — when the context is short, clean, and every token is signal. As the window fills, attention spreads thinner, earlier facts get crowded by later noise, and the model starts pattern-matching instead of reasoning. By the time you’re deep into a long session, you’re running the same model at a fraction of the quality it had when you started.

The exact shape of that curve is model-specific and not something to quote as a benchmark. But the direction is not in dispute, and you can feel it: the back half of a long session is where the dumb mistakes cluster. The any types. The forgotten file. The confident reintroduction of dead code.

This isn’t folklore. When researchers stress-tested eighteen frontier models — the current Claude, GPT, and Gemini families among them — on tasks deliberately kept trivial, every one got measurably worse as the input grew, often well before the advertised window was anywhere near full. A model sold with a 200k-token window can start slipping at 50k. And the decline is continuous, not a cliff you can see coming — there’s no alarm at the moment the answers start getting worse.

The mechanism is in the architecture. Attention isn’t uniform across a long context; it’s roughly U-shaped, biased toward the very start and the very end, with the middle systematically underweighted — the “lost in the middle” effect. The position embeddings most models use decay the link between distant tokens, so a fact you established forty minutes and thirty thousand tokens ago is exactly the kind of thing that fades into the underweighted middle. The model didn’t forget your decision. It just can’t attend to it as strongly as the noise that arrived after it.

A bigger window does not fix this. A million-token window just means you can degrade for longer before you hit the wall. You’ll fill it with two hours of half-relevant tool output and stale reasoning, and the model will be just as lost at token 600k as it was at 150k — only now it’s lost expensively.

This is a context-engineering problem in its purest form. The agent is broad and capable but has no durable memory of what this build has done so far. You, the engineer, are the one holding the thread — what’s done, what’s next, why the last attempt failed. The gap between the agent’s general competence and your specific, accumulated state is exactly the gap this site is about closing. The usual answer is to write context down. For a long build, you write down the plan.

Before the recipe, kill the thing most people reach for first.

When the window gets tight, most agents offer to compact: summarize the conversation so far and continue with the summary in place of the raw history. Auto-compact does this for you, silently, mid-task. It sounds like memory. It isn’t.

Each compaction pass is lossy. The agent decides what mattered and throws the rest away — and it does not have your judgment about what mattered. The decision you made forty minutes ago about why you’re not using the cache layer gets summarized to nothing, and ten minutes later the agent helpfully adds the cache layer. Worse, compaction compounds. Summarize a summary a few times and you get fuzz: a vague, confidently-worded gist of your build that drifts further from reality each pass. Repeated auto-compaction is how a strong model goes haywire on a task it could have nailed.

Durable memory does not belong in a summary the agent regenerates lossily on its own schedule. It belongs in a file you control. So take the agent out of the compaction business — see configuration for where this toggle lives in each tool. In Claude Code, the reliable switch is in-session: run /config and flip Auto-compact off. That hands you the full window with nothing held in reserve for a summary you never asked for.

/config → toggle "Auto-compact" off

There’s also a global flag — claude config set -g autoCompactEnabled false — but treat it as best-effort rather than a guarantee; the in-session toggle is the one that behaves predictably across versions. Either way, the goal is the same: nothing rewrites your history but you.

Now when the window fills, nothing happens automatically. That’s the point. You decide when to reset, and you carry the state across yourself — in a file.

The recipe: a plan file as external memory, run as a relay

Section titled “The recipe: a plan file as external memory, run as a relay”

The move is to stop treating one conversation as the unit of work and start treating one phase as the unit. State lives in a file. Each phase runs in a fresh session that reads the file, does one chunk, checks it off, and commits. Then you clear and hand the baton to the next fresh session. A relay, not a marathon.

It combines three primitives: plan mode to produce the plan without touching code, rules to make every fresh session start with your conventions already loaded, and configuration to disable the compaction that would otherwise corrupt the relay.

Step 1 — Plan first, in plan mode, into a file

Section titled “Step 1 — Plan first, in plan mode, into a file”

Don’t let the agent start editing. Put it in plan mode (the read-only, propose-don’t-touch mode covered in plan mode) and have it produce a phased plan as a real file in the repo. Phases, checkboxes, file paths. Concrete enough that a session with zero memory of this conversation could pick up any single phase and execute it.

# PLAN.md — Add org-scoped API keys
## Phase 1 — Schema & migration
- [ ] Add `api_keys` table (id, org_id, hash, last_used_at) — db/schema.ts
- [ ] Generate migration — db/migrations/
- [ ] Add `ApiKey` type — src/types/auth.ts
## Phase 2 — Issuance endpoint
- [ ] POST /orgs/:id/keys, returns key once — src/routes/keys.ts
- [ ] Store only the hash; never log the raw key — src/lib/keys.ts
- [ ] Unit tests for hashing + uniqueness — src/lib/keys.test.ts
## Phase 3 — Auth middleware
- [ ] Resolve key -> org on each request — src/middleware/apiKey.ts
- [ ] 401 on revoked/unknown — src/middleware/apiKey.ts
- [ ] Update `last_used_at` async — src/middleware/apiKey.ts

The file paths matter. They are how a fresh session orients in seconds instead of re-deriving your layout from scratch. The checkboxes matter too — they are the relay’s batons. An unchecked box is unambiguous: not done yet.

Step 2 — Make every fresh session start smart

Section titled “Step 2 — Make every fresh session start smart”

A relay only works if each runner starts at full speed. That’s what rules are for. Your AGENTS.md (or CLAUDE.md) loads automatically on every new session, so a fresh context isn’t a blank one — it already knows your stack, your conventions, and the relay protocol itself:

AGENTS.md
## Stack
- pnpm, not npm. Strict TypeScript — never use `any`.
- Tests live next to the file they test.
## Long builds
- For multi-phase work, the plan of record is PLAN.md.
- Execute ONE phase per session. Check off each item as you finish it.
- Commit at the end of each phase. Do not start the next phase.

Now “fresh session” doesn’t mean “amnesiac contractor.” It means a runner who’s read the playbook before stepping onto the track.

Step 3 — Run one phase per session, then clear and commit

Section titled “Step 3 — Run one phase per session, then clear and commit”

The execution loop is dull on purpose. Open a fresh session and give it exactly one instruction:

Complete Phase 1 of @PLAN.md and check off each item as you go.

The agent reads the plan, does Phase 1 in its sharp early-context zone, ticks the boxes, and stops. You review the diff. Then the ritual at every phase boundary:

Terminal window
git add -A && git commit -m "feat(auth): api key schema + migration (Phase 1)"
# then: clear the session entirely. New context. Reload the plan.

Clear means clear — not compact, not continue. The next phase starts from a clean window with the plan re-attached and the rules auto-loaded. The committed state is your checkpoint; if Phase 3 goes sideways, you reset to a green tree, not to a tangled half-conversation.

Step 4 — Restart before the wall, not at it

Section titled “Step 4 — Restart before the wall, not at it”

The trigger for clearing isn’t “the window is full.” It’s a context gauge passing somewhere around 40–50% — finish the item you’re on, tick it, clear, reload. Treat that number as a heuristic to tune, not a benchmark; the right line depends on the model and the messiness of the task. The principle is fixed even if the number isn’t: stay in the sharp early zone, and never let a session drift into its degraded tail.

Picture two ways to spend the identical token budget on the same feature.

Path A — the marathon. One session. You push through all three phases. By Phase 3 the window is around 150k tokens deep, thick with tool output, retries, and a compacted summary or two. The agent is in its degraded tail. It reintroduces a helper, slips an any into the middleware, and you spend the last hour cleaning up after the thing meant to be cleaning up after you.

Path B — the relay. Three fresh sessions, same total tokens spread across them. Every phase runs in clean early context. The edits are tighter because nothing is competing for the model’s attention but the current phase and the plan. Each commit is a safe checkpoint. The hard debugging in Phase 3 gets the model’s best reasoning instead of its worst.

Same model, same budget, same task. The only variable is whether each unit of work ran early in a window or late in one. That single variable is doing most of the work in how good the output is — and it’s a variable you control completely, for free.

Where the relay breaks — and where it’s overkill

Section titled “Where the relay breaks — and where it’s overkill”

The relay isn’t free, and pretending it is would be its own kind of dishonesty. Two failure modes are worth naming.

The first is a stale plan. The file is only external memory if it stays true. If Phase 1 reveals that the schema needs a column the plan never anticipated, and you don’t update PLAN.md before clearing, the next runner inherits a confident, wrong map — and a fresh session trusts the file precisely because it has no memory to contradict it. The discipline is: amend the plan as part of finishing a phase, not as an afterthought. The checkbox and the correction land in the same commit.

The second is phases that are too big. The whole point is that each phase fits comfortably inside the sharp early zone. A “phase” that’s really half the feature will itself drift into the degraded tail before its boxes are ticked, and you’re back to running a marathon — just one with a fancier name. If a phase consistently blows past your reset line before it’s done, that’s not a signal to push through. It’s a signal the plan needs splitting.

And sometimes the relay is simply the wrong tool. For a tight bug fix, a single exploratory spike, or any task that lives and dies inside one sharp session, the ceremony of plan-file-and-handoff is pure overhead — you’ll spend more tokens scaffolding the relay than the work itself costs. The relay earns its keep when the work is genuinely longer than one good session: multi-phase features, migrations, anything where you can already feel the second hour coming. Below that threshold, just do the work and clear when you’re done. Reach for the relay when the alternative is watching a strong model slowly lose the plot.

The next time you’re an hour into something big and the context gauge slides past 40%, don’t push on and don’t let it auto-compact. Finish the current item. Check the box. Commit. Clear the session. Reload the plan into a fresh window and keep going.

You’re not making the model smarter — you can’t. You’re refusing to spend its smartest tokens on stale context. The window is not your memory. The plan file is. Hand off the baton before the runner tires, and the relay will beat the marathon every time.