Dial model, thinking, and speed per task to control cost

The race is fixed, the copy tweak shipped hours ago, and if you scroll back through the afternoon you can see the two dials moving in opposite directions across a single session — light-and-shallow for the error string, capable-and-deep for the ledger bug. That’s not two settings you stumbled into; it’s the actual skill this chapter teaches: treating model and reasoning as a per-task spend, not a session-wide default you set once and forget. This last lesson is about making that a habit, and about the one case where you deliberately pay more.

Two independent dials

It helps to see clearly that you’ve been turning two separate knobs, and they don’t have to move together:

Model — which brain answers. Capable for hard reasoning, lighter and faster for mechanical work. Set with /model.
Effort / thinking — how hard that brain reasons before it acts. Up for ambiguity, down for the obvious. Set with /effort and the thinking toggle.

The cost map across the four corners is the whole point — and it’s better run than read. Here’s a lumpy day with both dials live on every task:

A lumpy day: four kinds of task, and two dials on each — which model answers, and how much effort it spends thinking. Everything starts where most people leave it: pinned to the expensive corner. Re-dial each task and watch what the day costs.

first runs495 unitsredo tax0 unitsvs the matched day1.9×

Everything ships — no failures, no redo tax — and the day still costs 1.9× what it should. That’s the quiet leak of a pinned dial: the mechanical work bills like hard work, five and three times over. Dial it down to the cheapest corner that ships it; the hard problems keep their budget.

Units are illustrative — one unit is roughly the light model running a small task at low effort. A capable model bills ~5× per token; high effort generates ~3× the tokens; the ratios are the point, not the prices. The redo tax counts an underpowered task’s failed attempts plus the escalation you do anyway — not the hour you spend reading confident wrong answers, which is the real bill.

The diagonal is where most people lose money. Running the capable model at high effort on a rename is paying top dollar for a task with one obvious answer. Running deep effort on a light model is spending thinking tokens a weaker brain can’t fully cash. You want the matched corners: cheap-and-shallow for mechanical work, expensive-and-deep for the genuinely hard problem — and you want to move between them as the work changes, which on a lumpy day like this one is often.

When speed is the thing you’re buying

There’s a third lever that isn’t about capability at all — it’s about latency. Some sessions are interactive in a way where waiting hurts: live debugging, rapid iteration where you’re firing a prompt every thirty seconds and the agent’s response time is the bottleneck. For those, Claude Code has fast mode, toggled with /fast:

/fast                # toggle fast mode on (or off again)

Fast mode runs the capable model markedly quicker at a higher per-token cost — same quality, lower latency, more money. It’s currently a preview feature and the exact tradeoff and availability are documented in fast-mode, so check there before you lean on it. The judgment is narrow and worth stating plainly:

Worth it — interactive, latency-sensitive work where your waiting time is the expensive resource. Rapid iteration, live debugging with a deadline.
Not worth it — long autonomous runs, CI, batch jobs, anything you walk away from. There, nobody’s waiting on latency, so you’d be paying the speed premium for nothing.

A practical note from the docs worth carrying: switching into fast mode mid-conversation re-prices the whole context at the faster rate, so if you know a session wants speed, turn it on at the start rather than flipping it on partway through. And fast mode stacks with the other dials — a lower effort level on top of fast mode is maximum speed for straightforward interactive work.

The discipline, stated once

Strip away the commands and this chapter is a single context-engineering move. You already learned to spend the context window on purpose in Sessions & context — watch it, fill it with signal, reclaim it before it bloats. Model and thinking are the same instinct aimed at the agent’s effort: spend the capable model and deep reasoning where the context is genuinely hard — the race no one could solve by pattern-matching — and pull both down the moment the work goes mechanical. The gap this whole site is about is the agent’s missing context; the cost lever is making sure you pay for closing that gap on the hard problems, not for over-powering the easy ones.

Do that by reflex and a lumpy day costs what it should: pennies on the copy tweak, real budget on the ledger bug, and nothing wasted on the diagonal between them.

Where this goes next

Everything so far has been one agent, with one model and one reasoning budget, working a task in front of you. That’s the base case — and it has a ceiling. When a job is genuinely big, or when one noisy step (reading a giant log, churning through a test suite) would flood the very context window you just learned to guard, the next move isn’t a bigger gear on a single agent. It’s more agents: pushing that noisy work into an isolated context of its own, and fanning a hard problem out across parallel workers that each carry their own window and report back a summary.

That’s the whole idea of Subagents — and it’s where the course goes next.