A Skill That Writes Your Skills

Watch a senior engineer crack a gnarly bug and you’ll swear you’re watching talent. The flaky integration test, green nine times, red the tenth. She doesn’t flail. She bisects the suite to isolate the offender, checks whether it shares a fixture with its neighbor, greps for shared mutable state, pins the system clock, runs it a hundred times in a loop, and inside ten minutes she’s holding the root cause: a test that leaks a global and only fails when the scheduler interleaves it with another.

That looked like intuition. It wasn’t. It was a procedure — a specific, ordered sequence of moves she’s run a hundred times until it went subconscious. And here’s the part that should bother you: next week a teammate hits the same class of flake and starts from zero. The fix got committed. The method walked out of the room in her head.

That’s the real loss. Not the bug — the method. The fix is worth nothing the second time; the method is worth everything every time. And almost nobody writes methods down.

Senior judgment is mostly procedure wearing a disguise

The consensus says expert judgment is ineffable — you either have the instinct or you don’t, and the only way to transfer it is years of osmosis. That’s half true. Taste is hard to transfer. But most of what we call senior judgment isn’t taste. It’s a decision procedure: when you see X, check Y before Z, and never skip the assumption-check at the start. That part is fully externalizable. You can write it down. You can hand it to a junior. And you can hand it to an agent.

This is the exact shape of the gap this whole site is about. An AI coding agent is broad and contextless — it knows a thousand debugging tactics in the abstract and has no idea which three your team reaches for, in what order, with what tripwires. You are narrow and deep — you know the order, the tripwires, the one fixture that leaks. The agent has the breadth; you have the procedure. Skills are how you give the agent your procedure.

A skill is a named, invokable bundle of instructions — a procedure the agent loads on demand. Not a paragraph buried in a config file it half-remembers. A discrete unit it reaches for when the trigger fires. The flaky-test hunt above is a perfect candidate. So is “cut a release,” “review this for our security rules,” “triage a prod incident.” Each is an afternoon of hard-won sequence you currently re-derive every time.

Write one skill by hand, so you know the shape

Before automating anything, encode the procedure once manually. Here’s the flaky-test hunt as a skill, in the house format:

---
name: hunt-flaky-test
description: >
  Isolate and root-cause a non-deterministic test failure.
  Use when a test passes locally but fails in CI, or fails
  intermittently with no code change. Do NOT use for a test
  that fails every run — that's a normal bug.
---

# Hunting a flaky test

## Step 0 — interrogate the assumption
Before touching anything, confirm the test is actually flaky:
re-run it 20x in a loop. If it fails every time, STOP — this
is a deterministic bug, not a flake. Say so and exit.

## Steps
1. Bisect the suite to find if the flake only appears when run
   alongside a neighbor (shared fixture or global state).
2. Grep the test and its fixtures for shared mutable state:
   module-level vars, singletons, env writes, temp files.
3. Pin all non-determinism: clock, RNG seed, timezone, locale.
4. Run the isolated test 100x. Reproduce the failure in
   isolation before proposing any fix.
5. Only once reproduced: propose the minimal fix and re-run 100x
   to confirm green.

## Inputs needed
- The failing test's path
- CI logs showing the intermittent failures (if available)

Two things make this load-bearing, and most hand-written procedures skip both.

Step 0 is an assumption-interrogation gate. The agent’s first instinct is to start fixing. A senior’s first move is to verify the problem is the problem you think it is. Encoding “before you touch anything, prove this is actually a flake” is the single highest-leverage line in the file — it stops the broad-but-eager agent from confidently solving the wrong problem.

The description is the trigger, written for a reader who hasn’t loaded the body. It says exactly when to reach for this — and, just as important, when not to. That precision is what makes the skill fire at the right moment instead of never.

Now the move that scales it: a skill that writes skills

Writing one skill is fine. Writing thirty by hand is how a good idea dies on contact with a busy week. So don’t. Build one meta-skill — a skill-authoring skill — and let it manufacture the rest.

The skill-authoring skill does one thing: it interviews you about a procedure you just ran the hard way, then emits a correctly-formatted skill file. You invoke it the moment you finish solving something nasty, while the sequence is still hot in your head.

Get the obvious objection out of the way first, because Anthropic’s own skill best-practices doc says it in as many words: you do not need a “writing skills” skill to teach the agent the file format. The model already understands the SKILL.md shape natively — ask it to write one and it will produce well-structured YAML and a sensible body without help. So if format were the point, the meta-skill would be redundant. Format isn’t the point. The meta-skill earns its keep on the two things the model doesn’t know: your house conventions — the Step 0 assumption-gate, the trigger / anti-trigger discipline, kebab-case verb naming, routing the file into the right directory — and your timing, capturing the procedure at the one moment it’s cheapest to extract, right after you solved something the hard way. That’s the gap exactly: the agent has the format; you have the conventions and the moment. The meta-skill is how you hand it both.

This isn’t a novel trick, either. Anthropic ships a real meta-skill, skill-creator, that creates, improves, and measures skills — and its process mirrors the example below almost step for step: capture intent, write the SKILL.md, spin up two or three eval cases, run the skill against a baseline in parallel, iterate, and tune the description until it triggers reliably. We’re claiming the lineage, not inventing it.

---
name: author-skill
description: >
  Capture a procedure you just performed as a new reusable skill.
  Use right after solving a non-trivial task you'll face again.
  Interviews for trigger, steps, and inputs, then writes a new
  skill file in the house format.
---

# Authoring a new skill

Interview the user, one question at a time. Do not write the
file until all five are answered.

1. **Trigger** — what situation should make an agent reach for
   this? Get the negative case too: when should it NOT fire?
2. **Assumption to interrogate** — what does a novice wrongly
   assume here that wastes an afternoon? This becomes Step 0.
3. **Steps** — the ordered sequence. Push for the *order* and
   the tripwires, not just the actions.
4. **Inputs** — what must the agent be handed to start?
5. **Name** — a short kebab-case verb phrase.

Then write the skill file — `.<agent>/skills/<name>/SKILL.md`,
which in Claude Code resolves to `.claude/skills/<name>/SKILL.md`
for a project or `~/.claude/skills/` for one you want everywhere —
with:
- YAML frontmatter: `name`, and a `description` that states
  both when to use AND when not to use it.
- A `## Step 0` assumption-interrogation gate.
- Numbered steps preserving order and tripwires.
- An `## Inputs needed` section.

Keep the frontmatter description tight — house rule, ~40 words.
(The platform's hard cap is 1024 characters; the name is
lowercase/numbers/hyphens, max 64. ~40 words is *our* discipline,
well under the ceiling.) Push detail DOWN into the body. The
body is read only when the skill runs.

Notice what just happened. The tacit knowledge in a senior’s head — the order, the tripwires, the “check this first” — gets pulled out by an interview and frozen into a file, in one command, at the one moment it’s easiest to extract: right after the win. The afternoon you spent root-causing becomes a capability anyone on the team — or any agent — invokes by name.

Progressive disclosure is what keeps a fat library cheap

Here’s the objection, and it’s a good one: if you encode thirty procedures, don’t you bloat the agent’s context window with thirty walls of text on every turn? If skills cost context, a growing library taxes every interaction — the cure becomes the disease.

This is exactly why the format discloses itself in layers, and why the meta-skill enforces the discipline. The mechanic is progressive disclosure, and it’s officially three levels, not two:

The metadata — name plus description, around a hundred tokens — is always loaded. It’s the index entry. The agent reads thirty one-line descriptions to know what’s available. Cheap.
The body — the full procedure, the steps, the tripwires — is loaded only when the skill is invoked. Anthropic’s own guidance is to keep the body under 500 lines — call it 5k tokens. Until the trigger fires, it costs nothing.
Bundled scripts and resources — a validator, a reference file, a generator — sit alongside the skill and load or execute on demand. A 200-line linting script costs zero context until the procedure calls it, and even then only its output lands in the window, not its source.

So adding the thirty-first skill grows the always-on index by one line and the on-demand cost by zero. The library scales without taxing every turn. That’s the whole trick, and it’s why the meta-skill’s instruction “keep the description short, push detail down into the body” isn’t style — it’s the load-bearing economics. A skill with a 300-word description that should’ve been a 30-word index entry quietly poisons the budget the library was supposed to protect.

That third level is the sharpest form of the thesis: a skill can ship a 200-line validator and still cost nothing to index, because the agent reads the script’s result, never the script. Breadth in the index, depth on demand, and the heaviest depth pushed all the way out to a file that only runs when the moment arrives.

This is also where the agent’s contextless breadth stops being a liability. The agent doesn’t need to hold your thirty procedures in working memory. It needs a cheap index it can always see, and the deep procedure delivered exactly when the matching situation arrives.

The cost progressive disclosure doesn’t pay

Cheap to index is not the same as easy to choose. Progressive disclosure solves the token cost of a fat library; it does nothing for the selection cost. Thirty one-line descriptions load cheaply, but they’re also thirty things the agent has to discriminate between on every turn — and that description is the only signal it gets at the moment it chooses. The body, the steps, the tripwires sit on disk, invisible until after the pick is made. So the real failure mode of a growing library isn’t a blown context budget. It’s the agent firing hunt-flaky-test on a test that fails every single run, or reaching for cut-release when you asked it to tag a hotfix, because two descriptions blurred into one in the index.

This is why the meta-skill spends a whole interview question on the trigger and its negative case. A description that says only what a skill does, never when to skip it, is the one that mis-fires once the library gets crowded. The forty words that keep a skill cheap to index are the same forty words that keep it accurate to pick — one line of text doing double duty, and the reason “tighten the description” is the highest-leverage edit in the file.

The rules pointer: telling the agent which skill to reach for

The descriptions get you most of the way there. You can make selection deterministic with one line in your rules file — the AGENTS.md or CLAUDE.md that’s always in context — mapping problem classes to skills:

## Reaching for skills
- Intermittent / CI-only test failure → use the `hunt-flaky-test` skill.
- Cutting a release → use the `cut-release` skill.
- Reviewing for security → use the `security-review` skill.
- Just solved something non-trivial you'll hit again →
  use the `author-skill` skill to capture it.

That last line is the flywheel made explicit. The rules file doesn’t just route to existing skills — it routes to the skill that creates skills. The library is instructed to grow itself.

If you’d rather pull than have the agent push, wire the same procedures as slash commands — /hunt-flaky-test, /author-skill — so a human invokes them by name on demand. Skills are what the agent autonomously reaches for; slash commands are the same procedures with a human finger on the trigger. Most teams want both: the agent self-serves from the library when it recognizes the situation, and you fire a command directly when you already know which procedure you need.

The next time you solve something the hard way

Stop treating your debugging sessions as one-offs. The fix you commit is the cheap half of the work. The method you used — the order, the assumption you checked first, the tripwire you knew to avoid — is the asset, and right now it evaporates the moment you close the terminal.

The discipline is small and it compounds. Encode procedures as skills. Let progressive disclosure keep the library free to grow — index always on, bodies on demand. Bootstrap the whole thing with one skill that writes the others by interviewing you while the win is still warm. Point your rules file at the library so the agent reaches for the right procedure, and at the meta-skill so the library keeps filling itself.

The gap between your broad, contextless agent and your deep, narrow team closes one procedure at a time. So the next time you crack something hard, don’t just fix it. Capture it. Turn your afternoon into a command — and never solve it from zero again.