Retrieval-Augmented Generation
Letting an agent fetch knowledge on demand instead of carrying it all in the prompt.
Why retrieval exists
No system prompt, however long, can contain everything an agent might need to know. RAG solves this by storing knowledge externally — in a vector database, a search index, or a structured store — and retrieving only the relevant slice at the moment it's needed. This keeps the context window lean while still giving the agent access to a knowledge base that may be orders of magnitude larger than anything that could fit in a single prompt.
From pipeline to tool call
Early RAG implementations treated retrieval as a fixed pipeline step: embed the query, search the vector store, inject the top-k results into the prompt, generate. This works for simple lookup tasks but breaks down for multi-step reasoning, where the right query isn't known until partway through the reasoning process. The more robust pattern treats retrieval as a tool the agent calls explicitly, as many times as needed, with queries it formulates itself based on what it has already learned — turning retrieval from a static pre-processing step into an active, iterative part of the reasoning loop.
Chunking strategy
How a document is split before embedding directly determines retrieval quality. Fixed-size chunking is simple and predictable but frequently splits a coherent idea across two chunks, weakening both. Recursive chunking respects document structure — paragraphs, sections, headers — and tends to produce more semantically coherent pieces. Semantic chunking goes further, using the content itself (rather than fixed boundaries) to decide where one idea ends and another begins, at the cost of more preprocessing complexity. The right choice depends on document type: dense technical documentation usually benefits from structure-aware chunking, while conversational or unstructured text may tolerate simpler fixed-size approaches.
Embedding model selection
Not all embedding models are interchangeable. Key tradeoffs include dimensionality (higher dimensions can capture more nuance but cost more to store and search), domain fit (a general-purpose embedding model may underperform a domain-tuned one on specialized technical or legal text), latency and cost per embedding call, and whether the model supports the languages your content actually uses. Benchmark against a representative sample of your own queries and documents — published leaderboards rarely reflect performance on your specific corpus.
Reranking and hybrid search
Vector similarity alone often surfaces results that are topically related but not actually useful for the query. Reranking — passing the top candidates from an initial retrieval through a second, more precise relevance model — consistently improves the quality of what actually reaches the agent's context. Hybrid search, combining vector similarity with traditional keyword search, helps when queries contain exact terms (product codes, proper nouns, specific identifiers) that embeddings alone tend to blur.
{ "name": "knowledge_retrieval", "description": "Search the knowledge base on demand during reasoning.", "parameters": { "operation": { "type": "string", "enum": ["keyword_search", "semantic_search", "read_chunk"] }, "query": { "type": "string" }, "chunk_id": { "type": "string" } } }
Part II — When vector RAG fails
Vector RAG excels at local questions: "What is the retry policy for webhook X?" It fails at global sensemaking: "What are the main themes across our entire integration catalog?" or "How do security practices evolved across all runbooks?" These are query-focused summarization problems over the full corpus — top-k similarity to a single query embedding cannot see the forest for the trees.
Recognize the failure mode early: if the correct answer requires synthesizing evidence distributed across hundreds of chunks with no single chunk scoring high on similarity, you need graph-level or community-level retrieval — not a larger k.
Part II — GraphRAG indexing pipeline
Microsoft's GraphRAG builds a knowledge graph from unstructured text: entities and relationships are extracted by an LLM, then clustered into communities using hierarchical Leiden detection. Each community receives a bottom-up summary — leaf communities first, then higher levels incorporating child summaries. The result is a tree of summaries that describes the corpus at multiple granularities before any user asks a question.
Indexing is expensive and belongs offline. Query-time cost depends on search mode: global search runs map-reduce over community reports; local search expands around specific entities; DRIFT combines a global primer with local refinement.
Part II — Three search modes
Local search — start from entities relevant to the query, pull neighboring graph nodes and associated text units. Best for specific factual lookups tied to named concepts.
Global search — select a community hierarchy level, retrieve all community reports at that level, generate partial answers in parallel (map), then synthesize (reduce). Best for thematic and comparative questions over the whole dataset.
DRIFT search — begin with global community context to frame the question, then branch into local searches for evidence. Best when the question has both a global frame ("overall risk posture") and local proof points ("this specific control").
Part II — Designing retrieval tools for agents
Agentic RAG treats retrieval as a tool the model invokes during reasoning — not a fixed pre-step. LlamaIndex's agent patterns show the minimum viable surface: distinct operations for keyword search, semantic search, and reading a specific chunk by ID. The agent formulates queries after partial reasoning, evaluates whether results are sufficient, and may call again with refined queries.
Case study: A compliance agent answering "Which integrations lack encryption at rest?" failed with vanilla top-k RAG because the answer required scanning policy mentions across unrelated product docs. GraphRAG global search at mid-community level surfaced cross-cutting themes; the agent then used local search to cite specific integrations. Accuracy improved from 42% to 88% on a 50-question eval set.
{ "name": "knowledge_retrieval", "parameters": { "operation": { "enum": ["keyword_search", "semantic_search", "read_chunk", "global_community_search"] }, "query": { "type": "string" }, "chunk_id": { "type": "string" }, "community_level": { "type": "integer", "description": "0=leaf detail, higher=broader themes" } }, "harness_rules": { "max_calls_per_turn": 4, "rerank_top_k": 8, "inject_max_tokens": 2000 } }
Further reading
The transition from passive RAGs (where retrieval is done before calling the LLM via a fixed semantic search pipeline) to Agentic RAG, where the model decides if it needs to search, what to search for, and evaluates if the result was satisfactory before responding.
- GraphRAG: Unlocking LLM Discovery (Microsoft Research) — The current vanguard of RAG. Instead of just vectorizing chunks of text, GraphRAG builds hierarchical knowledge graphs, solving the problem of global questions that vector RAGs fail to answer.
- LlamaIndex: Agentic RAG Concepts — A deep guide on how to model data retrieval as tools that the agent invokes interactively (e.g.,
search_docs,summarize_table), rather than chunks pushed by force.