Context Engineering

Treating the context window as a finite, costly resource — not an inbox.

From prompt engineering to context engineering

Prompt engineering optimizes the wording of instructions. Context engineering optimizes everything else the model sees at inference time: system prompt, tool definitions, retrieved documents, message history, and intermediate reasoning. As agents move from single-turn classification tasks to multi-turn autonomous loops, the dominant failure mode shifts from "the instructions were unclear" to "the model was given too much, or the wrong things, to reason over."

Context rot

Context windows are not free real estate. As token count grows, a model's ability to accurately recall and act on any individual piece of information decreases — a phenomenon often called context rot. This means doubling the context window does not double an agent's effective working memory; past a certain density, additional tokens add noise faster than they add signal. The practical implication: an agent with a smaller, carefully curated context will frequently outperform one with a larger, unfiltered one.

Context pollution and confusion

Two specific failure patterns show up repeatedly in production agents. Context pollution is when stale or irrelevant information lingers in context and influences later reasoning steps even though it's no longer relevant — a tool result from three steps ago that the model keeps referencing. Context confusion is when too many overlapping options (tools, retrieved chunks, instructions) make it genuinely ambiguous which one applies, and the model picks arbitrarily rather than correctly. Both are architectural problems, not model limitations, and both are addressed by actively curating what stays in context rather than letting it accumulate.

Four core strategies

Offloading: move static or rarely-needed content out of the live context and into external storage (files, a memory tool, a database), pulling it back in only when actually needed. Retrieval: rather than front-loading all potentially relevant information, let the agent pull it dynamically via tool calls at the moment it recognizes a need — this produces more targeted, better-timed retrieval than static injection. Isolation: give sub-agents their own clean, narrow context windows for focused sub-tasks, so the complexity of one part of a job doesn't pollute the reasoning for another. Compression: periodically distill a long history into a high-fidelity summary that preserves what matters and discards what doesn't, so the agent can continue a long-running task without dragging the full transcript forward indefinitely.

A practical heuristic

Before adding anything to context, ask: does this token increase the likelihood of the desired outcome more than it increases noise? If a human engineer reading the same context would struggle to find the relevant piece of information, the model will struggle too — context engineering is fundamentally an information design problem, not a token-counting exercise.

Compacted context summary placeholder
{
  "compacted_context": {
    "task_id": "deploy-api-v2",
    "decisions_made": [
      "Use blue-green deployment",
      "Rollback if error rate exceeds 2%"
    ],
    "open_questions": [
      "Which region should receive traffic first?"
    ],
    "failed_approaches": [
      "Direct schema migration caused lock timeout"
    ],
    "dropped": {
      "raw_tool_outputs": true,
      "superseded_reasoning": true
    }
  }
}

Part II — The U-curve of attention

The "Lost in the Middle" paper demonstrated empirically that LLMs disproportionately use information at the beginning and end of context, often ignoring critical details buried in the middle. This is not a minor quirk — it is the mechanical explanation for context rot. Doubling context length does not double effective memory; it shifts the signal-to-noise problem toward noise unless you engineer placement deliberately.

Operational rule: place non-negotiable instructions and the current task objective at the front and rear of the working context. Never sandwich identity constraints between fifty tool-result blobs. If a human engineer cannot skim the context and find the active rule in under ten seconds, the model will struggle too.

Part II — Static vs dynamic context zones

Anthropic's prompt caching guidance formalizes what production teams discovered empirically: static content belongs at the front (system prompt, tool schemas, skill index summaries), dynamic content at the end (latest user message, fresh retrieval, current task state). The cacheable prefix can be large if it is stable; the volatile suffix should stay as small as possible while remaining sufficient for the task.

Partition your context budget explicitly: reserve 40–60% for stable identity and tool contracts, 20–30% for retrieved or memory-backed facts, and the remainder for recent turns and the active user request. When any zone overflows, compact or offload that zone — never compress by deleting constraints silently.

Part II — Compaction that preserves signal

Compaction is not summarization for its own sake — it is lossy compression with a whitelist. Always preserve: decisions made, open questions, failed approaches (so the agent does not retry them), active entity IDs, and constraint violations encountered. Always drop: raw tool stdout, superseded reasoning, duplicate retrieval chunks, and conversational filler.

Run compaction at predictable boundaries: after multi-step tool chains, before sub-agent delegation returns, and when token count crosses a harness-defined threshold — not on every turn, which erases useful local context.

Part II — Sub-agent isolation as context hygiene

When a sub-agent's full working trace is merged into the orchestrator's context, you import its noise along with its answer. The orchestrator should receive condensed results — a structured summary plus any artifacts the parent truly needs — not the sub-agent's entire message history.

Case study: A research sub-agent retrieved forty documentation chunks; the orchestrator's next turn hallucinated API versions because chunk metadata drowned the user's actual question. Fix: sub-agent returns { summary, citations, confidence, artifacts[] } capped at 800 tokens; full chunks stay in sub-agent storage only.

Compaction policy by message type
{
  "compaction_policy": {
    "preserve": ["decisions", "open_questions", "failed_approaches", "entity_ids", "constraint_violations"],
    "drop": ["raw_tool_stdout", "superseded_reasoning", "duplicate_retrieval_chunks"],
    "triggers": {
      "token_threshold": 120000,
      "after_tool_chain_length": 5,
      "before_subagent_merge": true
    },
    "placement": {
      "static_prefix": ["system", "tool_schemas", "skill_index"],
      "dynamic_suffix": ["user_message", "fresh_retrieval", "task_state"]
    }
  }
}

Further reading

Context is not a static database; it is the volatile working memory of the model (its RAM). The biggest mistake novice engineers make is context pollution (clogging the context with garbage), which leads to the degradation of the model's attention (the infamous "Lost in the Middle").