Half your agent's 'hallucinations' never reached the prompt

Your agent confidently told a user that the orders service retries failed webhooks three times. It retries zero times. The instinct in the room is unanimous: it hallucinated, so we harden the prompt. Add a line — “Only state facts you can verify from retrieved context. Do not guess.” Ship it. Feel better.

That line will do nothing, because the model didn’t guess. The retry-policy doc was never in its context window. You’re about to spend an afternoon engineering a prompt to suppress an invention that never happened.

A wrong answer is two different bugs wearing the same costume. One is a retrieval failure: the right chunk never made it into the context window, so the agent answered from priors. The other is a generation failure: the right chunk was sitting in context and the agent contradicted it, skimmed it, or blended it with something else. These look identical in the output — a confident, wrong sentence. They have nothing else in common. And here’s the question that should bother you before the fix: if you can’t see what the agent actually retrieved, how do you know which bug you have?

You cannot debug a context window you cannot see

The mainstream move — pile on anti-hallucination instructions — is reasonable if every wrong answer is a generation failure. Steel-manned: a capable model with the facts in front of it that still fabricates is a model that needs tighter constraints. That’s real. It’s just not most of what you’re seeing when an agent reaches your systems through an MCP server and gets things wrong.

When an agent queries your docs, your database schema, your incident history through MCP, the retrieval step is silent. The tool call returns something, the model folds it into the window, and you only ever see the prose at the end. The window — the actual evidence the model reasoned over — is invisible unless you log it. So the first move is not a prompt. It’s a probe.

Log the retrieved payload on every tool call, verbatim, before the model touches it:

def on_mcp_tool_result(tool_name, query, result):
    log.info("retrieval", extra={
        "tool": tool_name,
        "query": query,
        "chunks": [c["id"] for c in result.get("chunks", [])],
        "bytes": len(result.get("text", "")),
        "raw": result.get("text", ""),
    })
    return result

Now the wrong answer stops being a mystery. Either the retry-policy chunk is in that raw blob or it isn’t. That single fact splits your afternoon in two.

Score retrieval and faithfulness as two separate binaries

Don’t reach for one fuzzy “quality” number. Score each failed answer on two independent yes/no axes, because they point at opposite parts of your stack. This isn’t a homegrown heuristic — it’s canonical error analysis: localize the failed stage before you fix it. The RAG evaluation literature has formalized exactly this split, RAGChecker most explicitly: retriever-side metrics (claim recall, context precision) on one side, generator-side metrics (faithfulness, the term RAGAS coined) on the other. Two layers, two scorecards.

Recall: was the supporting fact retrieved at all? Look in the logged context. The fact is either present or absent. Binary.

Faithfulness: given what was retrieved, is the answer consistent with it? Read the answer against the logged chunks only. If the chunk says “retries: 0” and the answer says “three times,” that’s unfaithful. If the chunk was never there, treat faithfulness as not applicable — the frameworks would technically score answering-from-priors as a generation-side hallucination, but for routing it’s cleaner to stop at recall: a missing chunk is a wiring problem, and you can’t be unfaithful to evidence you never saw.

That ordering is the whole discipline. Faithfulness is undefined until recall passes. A flat scorer that lumps them together marks both the retry-policy miss and a genuine contradiction as “hallucination,” and sends you to the same useless fix for both. And a blended score doesn’t just mislabel — it goes blind at exactly the moment you need to see. In one memory-agent study, a token-level F1 scorer sat nearly flat (≈0.22 vs 0.17) across a 20-point swing in actual accuracy: the aggregate number stayed quiet while the real story — which stage was failing — moved underneath it. One fuzzy quality metric doesn’t just blur the diagnosis; it goes dark right when the stack is shifting under you.

This split has empirical teeth. The same work found that swapping retrieval methods moved end-to-end accuracy by roughly twenty points (about 57% to 77%), while changing how the agent wrote or generated moved it only three to eight — and retrieval-stage quality near-perfectly predicted downstream accuracy. The lever that matters most is the one a prompt edit can’t reach.

answer: "orders retries failed webhooks three times"
  recall(retry_policy_doc):  MISS   ← chunk absent from context
  faithfulness:              N/A    ← nothing to be faithful to
  → verdict: RETRIEVAL FAILURE

versus

answer: "orders retries failed webhooks three times"
  recall(retry_policy_doc):  HIT    ← chunk says "retries: 0"
  faithfulness:              FAIL   ← answer contradicts the chunk
  → verdict: GENERATION FAILURE

Same sentence. Opposite root cause. Opposite fix.

Each verdict routes to a different layer of the stack

This is where the split pays for itself, because the two verdicts don’t just have different fixes — they have fixes on different surfaces, owned by different parts of your config.

A recall miss is a wiring problem. No prompt instruction touches it. The fact never entered scope, so there is no sentence you can add to the system prompt that makes the model reason over text it doesn’t have. The fix lives in the MCP layer: the retry doc isn’t indexed, or the MCP server exposes a search tool whose query never matched it, or chunking split the policy across a boundary so the relevant line scored too low to return. You fix the index, the tool surface, or the query — never the prose.

A faithfulness miss is a behavioral problem, and that’s exactly what rules are for. The context was right there; the model still drifted. This is the legitimate home of the instruction everyone wanted to write on day one — but now you write it knowing it will actually land, because the evidence was present:

# AGENTS.md — answering from retrieved context

- Quote the exact value from retrieved docs. Never round, infer, or
  generalize a numeric or policy field (retry counts, timeouts, limits).
- If retrieved context does not contain the answer, say so and name the
  gap. Do not fall back to general knowledge for system-specific facts.

The second bullet does quiet double duty: with a faithfulness rule in place, when the agent still says “I couldn’t find the retry policy,” you’ve converted a silent recall miss into a loud one. The agent reports the gap instead of papering over it — which sends you straight back to the MCP wiring with a clear signal.

For the heavier evaluation, isolate it. Spin a separate subagent whose only job is to read the logged chunks and the final answer and emit the two binaries — no access to the live tools, no ability to “help” by re-retrieving. A grader that can fetch its own context will paper over the exact retrieval gaps you’re trying to measure. Clean room, two booleans, done.

Where the two binaries start to fray

Two yes/no axes route most wrong answers cleanly. Two real cases bend them — and knowing where the model breaks is what keeps you from trusting the scorecard past its range.

The first is the multi-hop question. “Does the EU tier inherit the orders retry policy?” needs two chunks — the tier-inheritance rule and the retry policy itself — and the answer is right only if both arrive. Recall stops being a single binary and becomes a set: which of the supporting facts landed, and which dropped. A half-retrieved answer scores as a recall miss, correctly, but the fix is finer-grained — you’re no longer asking “did retrieval work,” you’re asking “which hop failed.” So score recall per required claim, not per question, the moment your answers depend on more than one fact. That’s exactly RAGChecker’s claim-level recall, and it’s why the framework grades individual claims instead of whole responses.

The second case is nastier, because it passes every check. Recall hits — the retry-policy chunk is right there in the window. Faithfulness holds — the answer quotes it word for word. And the answer is still wrong, because the chunk itself is stale: the doc says “retries: 3,” the code was changed to zero last quarter, and nobody updated the page. Your scorecard reads clean on both axes while the user walks away misinformed. A faithfully-quoted stale doc is the most dangerous failure in the set — it survives every automated grader and arrives wearing the same confidence as a correct answer, because by the pipeline’s own standard it is correct.

No prompt and no index fixes this one. The bug sits upstream of both, in whatever is supposed to keep the corpus honest about the system it describes. But the two-binary scorecard is still what corners it: when recall and faithfulness both pass and the answer is wrong anyway, you’ve ruled out the wiring and the model and localized the fault to the document itself. That’s a third surface — source-of-truth hygiene, owned by whoever owns the docs — and you only reach it by eliminating the other two first.

When the split isn’t worth the wiring

If your agent has no retrieval layer — it answers from the model’s parameters alone, no MCP, no docs, no tools in the loop — there is no recall axis to score, and every wrong answer genuinely is a generation problem. The whole discipline assumes a retrieval step to isolate; with nothing retrieved, you’re back to prompt and model, and that’s fine. Likewise, below a handful of wrong answers a week, the logging harness costs more than the triage it buys you — read the transcripts by hand. The split earns its keep the moment retrieval goes silent and wrong answers are frequent enough that you’re guessing at root cause. For any agent reaching real systems through MCP, that’s most of them.

The fix you reach for first is usually the fix for the bug you don’t have

Teams stall on “hallucinations” because every wrong answer looks like a generation problem from the outside — confident prose, wrong fact, obvious fix. But confidence is a property of the model, not a signal about where the fact went missing. Until you’ve logged the window and scored recall separately from faithfulness, you’re tuning the prompt for a class of bug that, half the time, your prompt cannot reach.

This is the context-engineering gap in miniature: when the agent isn’t lying it’s contextless — it answered the question without the page that holds the answer, because nothing in your wiring put that page in front of it. Log the retrieval. Score two binaries. Route recall misses to the index and faithfulness misses to the rules. Which bucket dominates is system-dependent — in some corpora reasoning and generation are the bigger failure share — but the retrieval miss is the under-diagnosed one. It wears the “hallucination” costume and gets the wrong fix more often than the reflex assumes. Sometimes the model invents. More often than you’d guess, you simply never handed it the page.