What is RAG

The LLM Knowledge Problem

LLMs have two fundamental knowledge flaws:

Knowledge cutoff: A model's knowledge stops at its training data cutoff date. A model trained through January 2024 knows nothing about February 2024. Ask about recent news and it'll either admit ignorance or confidently fabricate an answer.

Hallucination: Models confidently produce plausible but completely incorrect information. They're not "looking up" answers — they're "generating" the most probable text. Those are fundamentally different things.

RAG: Giving LLMs a Search Engine

RAG (Retrieval-Augmented Generation) is a simple idea:

Before the model answers, retrieve relevant information from an external knowledge base, stuff that information into the prompt, and have the model answer based on it.

An analogy: if an LLM is a knowledgeable expert who might misremember details, RAG is checking the reference materials before they answer.

How RAG Works

User asks a question
   ↓
Convert question to vector (Embedding)
   ↓
Search vector database for most relevant document chunks
   ↓
Send retrieved chunks + original question to LLM
   ↓
LLM generates answer based on provided context

A concrete example with a company knowledge base:

User: "What's our refund policy?"

1. Convert question to vector
2. Find most relevant docs in knowledge base:
   - "refund-policy.md" → "Customers can request a full refund within 30 days..."
   - "support-process.md" → "Refund review takes 3-5 business days..."

3. Send docs + question to LLM:
   "Answer the user's question based on these references.
    References: [refund policy docs...]
    Question: What's our refund policy?"

4. LLM generates based on docs:
   "Per company policy, customers can request a full refund within 30 days
    of purchase, with a review period of 3-5 business days."

RAG vs Alternatives

	RAG	Fine-tuning	Long Context
Knowledge updates	Real-time (just update the KB)	Requires retraining	Real-time (put in context)
Implementation difficulty	Medium	High	Low
Cost	Vector DB + retrieval	Training costs	High token costs for long context
Knowledge volume	Very large (millions of docs)	Limited (training data size)	Limited by context window
Source attribution	Yes	No	Yes
Best for	KB Q&A, document search	Style/format customization	Small document analysis

When to Use RAG

Knowledge base is large (exceeds context window)
Knowledge needs frequent updates
You need source attribution (traceability)
Answering questions based on private data

When NOT to Use RAG

Questions don't need external knowledge (pure reasoning, code generation)
Document volume is small enough to fit in context directly
You need the model to change its behavior, not its knowledge (use fine-tuning)

Core Components of RAG

A RAG system has several key parts:

1. Document Ingestion

Process raw documents (PDFs, web pages, Markdown, etc.) into a retrievable format:

Parse documents
Split into chunks (Chunking)
Convert to vectors (Embedding)
Store in vector database

2. Retrieval

Find the most relevant document chunks for a user query:

Convert query to vector
Search for similar vectors in the database
Optionally supplement with keyword search
Sort and filter results

3. Generation

Send retrieved information and the user question to an LLM:

Design appropriate prompt templates
Handle "no relevant information found" cases
Have the model cite sources

Subsequent chapters will dive deep into each component.

A Minimal RAG Example

Core RAG logic in pseudocode:

# 1. Preparation: process documents
documents = load_documents("./knowledge_base/")
chunks = split_into_chunks(documents, chunk_size=500)
embeddings = embed(chunks)
vector_db.store(chunks, embeddings)

# 2. Query phase
query = "What's our refund policy?"
query_embedding = embed(query)
relevant_chunks = vector_db.search(query_embedding, top_k=3)

# 3. Generation phase
prompt = f"""Answer the question based on the references below.
If the references don't contain relevant info, say "I'm not sure."

References:
{format_chunks(relevant_chunks)}

Question: {query}"""

answer = llm.generate(prompt)

That's the entire core logic of RAG. Of course, each step has many details and optimization opportunities — that's what the following chapters cover.

Key Takeaways

RAG = Retrieval + Generation. Find relevant information first, then have the LLM answer based on it. Solves knowledge cutoff and hallucination problems.
RAG suits scenarios with large, frequently updated knowledge bases that need source attribution. For small document volumes, just use long context.
Three core components: document ingestion, retrieval, generation. Each affects the final quality.
RAG and fine-tuning solve different problems. RAG supplements knowledge, fine-tuning changes behavior. They can be combined.