Retrieval-Augmented Generation

Letting an agent fetch knowledge on demand instead of carrying it all in the prompt.

Why retrieval exists

No system prompt, however long, can contain everything an agent might need to know. RAG solves this by storing knowledge externally — in a vector database, a search index, or a structured store — and retrieving only the relevant slice at the moment it's needed. This keeps the context window lean while still giving the agent access to a knowledge base that may be orders of magnitude larger than anything that could fit in a single prompt.

From pipeline to tool call

Early RAG implementations treated retrieval as a fixed pipeline step: embed the query, search the vector store, inject the top-k results into the prompt, generate. This works for simple lookup tasks but breaks down for multi-step reasoning, where the right query isn't known until partway through the reasoning process. The more robust pattern treats retrieval as a tool the agent calls explicitly, as many times as needed, with queries it formulates itself based on what it has already learned — turning retrieval from a static pre-processing step into an active, iterative part of the reasoning loop.

Chunking strategy

How a document is split before embedding directly determines retrieval quality. Fixed-size chunking is simple and predictable but frequently splits a coherent idea across two chunks, weakening both. Recursive chunking respects document structure — paragraphs, sections, headers — and tends to produce more semantically coherent pieces. Semantic chunking goes further, using the content itself (rather than fixed boundaries) to decide where one idea ends and another begins, at the cost of more preprocessing complexity. The right choice depends on document type: dense technical documentation usually benefits from structure-aware chunking, while conversational or unstructured text may tolerate simpler fixed-size approaches.

Embedding model selection

Not all embedding models are interchangeable. Key tradeoffs include dimensionality (higher dimensions can capture more nuance but cost more to store and search), domain fit (a general-purpose embedding model may underperform a domain-tuned one on specialized technical or legal text), latency and cost per embedding call, and whether the model supports the languages your content actually uses. Benchmark against a representative sample of your own queries and documents — published leaderboards rarely reflect performance on your specific corpus.

Reranking and hybrid search

Vector similarity alone often surfaces results that are topically related but not actually useful for the query. Reranking — passing the top candidates from an initial retrieval through a second, more precise relevance model — consistently improves the quality of what actually reaches the agent's context. Hybrid search, combining vector similarity with traditional keyword search, helps when queries contain exact terms (product codes, proper nouns, specific identifiers) that embeddings alone tend to blur.

Retrieval tool definition placeholder

{
  "name": "knowledge_retrieval",
  "description": "Search the knowledge base on demand during reasoning.",
  "parameters": {
    "operation": {
      "type": "string",
      "enum": ["keyword_search", "semantic_search", "read_chunk"]
    },
    "query": { "type": "string" },
    "chunk_id": { "type": "string" }
  }
}

Part II — When vector RAG fails

Vector RAG excels at local questions: "What is the retry policy for webhook X?" It fails at global sensemaking: "What are the main themes across our entire integration catalog?" or "How do security practices evolved across all runbooks?" These are query-focused summarization problems over the full corpus — top-k similarity to a single query embedding cannot see the forest for the trees.

Recognize the failure mode early: if the correct answer requires synthesizing evidence distributed across hundreds of chunks with no single chunk scoring high on similarity, you need graph-level or community-level retrieval — not a larger k.

Part II — GraphRAG indexing pipeline

Microsoft's GraphRAG builds a knowledge graph from unstructured text: entities and relationships are extracted by an LLM, then clustered into communities using hierarchical Leiden detection. Each community receives a bottom-up summary — leaf communities first, then higher levels incorporating child summaries. The result is a tree of summaries that describes the corpus at multiple granularities before any user asks a question.

Indexing is expensive and belongs offline. Query-time cost depends on search mode: global search runs map-reduce over community reports; local search expands around specific entities; DRIFT combines a global primer with local refinement.

Part II — Three search modes

Local search — start from entities relevant to the query, pull neighboring graph nodes and associated text units. Best for specific factual lookups tied to named concepts.

Global search — select a community hierarchy level, retrieve all community reports at that level, generate partial answers in parallel (map), then synthesize (reduce). Best for thematic and comparative questions over the whole dataset.

DRIFT search — begin with global community context to frame the question, then branch into local searches for evidence. Best when the question has both a global frame ("overall risk posture") and local proof points ("this specific control").

Part II — Designing retrieval tools for agents

Agentic RAG treats retrieval as a tool the model invokes during reasoning — not a fixed pre-step. LlamaIndex's agent patterns show the minimum viable surface: distinct operations for keyword search, semantic search, and reading a specific chunk by ID. The agent formulates queries after partial reasoning, evaluates whether results are sufficient, and may call again with refined queries.

Case study: A compliance agent answering "Which integrations lack encryption at rest?" failed with vanilla top-k RAG because the answer required scanning policy mentions across unrelated product docs. GraphRAG global search at mid-community level surfaced cross-cutting themes; the agent then used local search to cite specific integrations. Accuracy improved from 42% to 88% on a 50-question eval set.

Multi-hop retrieval tool — agentic pattern

{
  "name": "knowledge_retrieval",
  "parameters": {
    "operation": { "enum": ["keyword_search", "semantic_search", "read_chunk", "global_community_search"] },
    "query": { "type": "string" },
    "chunk_id": { "type": "string" },
    "community_level": { "type": "integer", "description": "0=leaf detail, higher=broader themes" }
  },
  "harness_rules": {
    "max_calls_per_turn": 4,
    "rerank_top_k": 8,
    "inject_max_tokens": 2000
  }
}