What Is RAG (Retrieval-Augmented Generation)?

RAG (Retrieval-Augmented Generation) is a technique that combines large language model text generation with retrieval from an external knowledge base. The model fetches relevant documents at query time and uses them as context for generation, allowing it to access specific information not present in its training data.

Frequently Asked Questions

Why is RAG useful?

RAG allows AI to answer questions about specific documents, internal company knowledge, recent information, or proprietary data without needing to retrain the model. It also reduces hallucinations because the model can cite specific source documents. It is the standard pattern for "chat with your documents" applications.

How does RAG work?

Three steps: (1) Retrieval — given a user query, the system searches a vector database for the most relevant documents; (2) Augmentation — the retrieved documents are added to the model context along with the user query; (3) Generation — the model produces an answer grounded in the retrieved context.

What is a vector database?

A vector database stores documents as numerical vectors (embeddings) that represent semantic meaning. Queries are also embedded and compared against stored vectors to find similar documents. Examples: Pinecone, Weaviate, Chroma, Qdrant.

When should I NOT use RAG?

When the data is small enough to fit directly in the model context. Modern models (Claude Opus 4 with 1M context, Gemini 2.5 Pro) can handle 200-1000 page documents directly. For small knowledge bases, direct context inclusion outperforms RAG. RAG matters when you have hundreds of thousands of documents.

Is RAG still relevant with long-context models?

Yes, but the threshold has shifted. RAG is now most useful for: (1) very large knowledge bases (millions of documents), (2) frequently-updated data, (3) cost-sensitive applications where putting everything in context is too expensive, (4) hybrid systems that need both retrieval and generation.

What is the difference between RAG and fine-tuning?

Fine-tuning permanently changes model weights based on training data — best for teaching the model a domain or style. RAG provides knowledge dynamically at query time without changing the model — best for factual lookup. They are complementary; production systems often use both.

Related Resources

Get new Originals every Friday

2-3 hand-crafted Originals per week. No spam, no upsells, unsubscribe in 1 click.

Or subscribe directly on Beehiiv →