How to Explain RAG Architecture in Interview (With Diagram & Code)

Last month, three different companies asked me the same interview question: "How do we stop our AI from making things up?" My answer was always the same four letters: R-A-G. Candidates who could clearly explain RAG architecture received offers almost 50% faster than those who couldn't.

Introduction: Why Do Interviewers Ask About RAG?

In Generative AI interviews today, one question comes up again and again:

"Can you explain RAG architecture, and why do we need it?"

If a candidate talks only about LLMs, interviewers quickly push further:

How do you handle hallucinations?
What about outdated knowledge?
How do you use enterprise or private data safely?

This is where Retrieval-Augmented Generation (RAG) becomes essential.

RAG directly addresses all of these problems—and that's why interviewers care so much about it.

Learn RAG Fundamentals — Manual Implementation Without Frameworks

What Is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is a technique that combines external knowledge sources with Large Language Models (LLMs) to produce more accurate, grounded, and reliable responses.

Instead of relying only on what the model learned during training, RAG allows the model to look up relevant information at query time.

Simple Definition

RAG is an architecture where an LLM first retrieves relevant information from external data sources and then uses that information to generate a response.

This one sentence alone is often enough to pass the first round of GenAI interviews.

Problem with Normal LLMs (Why RAG Is Needed)

Traditional LLMs work only on pre-trained knowledge, which leads to several limitations:

Issue	Normal LLM
Hallucinations	Yes
Real-time / updated data	No
Private PDFs & documents	No
Enterprise data usage	Risky

👉

RAG addresses all of these limitations by grounding responses in retrieved data.

RAG Architecture

RAG Architecture Diagram

Data Sources

Embedding Model

Vector Database

Retriever

LLM Generator

The image represents a standard, production-grade RAG pipeline. Let's break it down step by step.

Core Components of RAG

Data Source (PDFs, Docs, URLs)
Embedding Model
Vector Database
Retriever
LLM (Generator)

RAG Has Two Mandatory Phases

Phase 1: Indexing (Offline / One-Time)

Load documents from various sources
Split documents into chunks
Convert chunks to embeddings
Store embeddings in vector database

Phase 2: Retrieval + Generation (Runtime)

User query comes in
Convert query to embedding
Retrieve relevant chunks from vector DB
Generate response using LLM with retrieved context

Phase 1: Indexing (Offline)

We'll go through all four steps in detail without using a framework, and we'll explain why the industry uses RAG with frameworks.

Load documents
Split into chunks
Convert chunks → embeddings
Store embeddings in Vector DB

Without Phase-1, RAG does NOT exist

1️⃣ Load Documents (PDF, TXT, Multiple Files)

Convert raw files (PDFs, text files, etc.) into:

Clean, usable text
Structured metadata

This step is the foundation of the RAG indexing phase.

If document loading is weak, everything downstream (chunking, retrieval, generation) breaks.

Challenges Interviewers Expect You to Mention

When loading documents for RAG, interviewers look for awareness of real-world issues:

Encoding issues (UTF-8 vs other encodings)
Page separation (especially in PDFs)
Metadata preservation (source, page number, department, year)
Multiple file formats (PDF, TXT, DOC, HTML)

Mentioning these signals production-level thinking, not tutorial-level knowledge.

Example: Loading PDF and TXT Files (Manual Approach)

Below is a framework-free, manual implementation for loading PDF and TXT documents while preserving metadata.

import os
from typing import List, Dict

import PyPDF2

def load_documents(folder_path: str) -> List[Dict]:
    """
    Load PDF and TXT documents.
    Returns list of dicts with text + metadata
    """
    documents = []

    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)

        if file_name.endswith(".txt"):
            with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
                text = f.read()

            documents.append({
                "text": text,
                "metadata": {
                    "source": file_name,
                    "type": "txt"
                }
            })

        elif file_name.endswith(".pdf"):
            with open(file_path, "rb") as f:
                reader = PyPDF2.PdfReader(f)

                for page_num, page in enumerate(reader.pages):
                    text = page.extract_text() or ""

                    documents.append({
                        "text": text,
                        "metadata": {
                            "source": file_name,
                            "page": page_num,
                            "type": "pdf"
                        }
                    })

    return documents

Usage & Output

docs = load_documents("./documents")

print(len(docs))
print(docs[0]["metadata"])

Output:

12
{'source': 'company_policy.pdf', 'page': 0, 'type': 'pdf'}

Each page becomes a separate document, which significantly improves:

Chunking accuracy
Retrieval precision
Answer traceability

Why Metadata Is NOT Optional

Metadata is critical in real-world RAG systems. Without metadata:

No filtering by department, year, or source
No document-level traceability
No explainability for answers
Hard to debug incorrect responses

With metadata, you can:

Filter documents (department = finance)
Restrict search (year = 2024)
Show sources in answers (Based on company_policy.pdf, page 3)

👉 Enterprise RAG without metadata is incomplete and unsafe.

2️⃣ Split Text Into Chunks (Critical Step)

Chunking is one of the most important and most misunderstood steps in Retrieval-Augmented Generation. If chunking is wrong:

Retrieval quality drops
Answers become incomplete
Hallucinations increase

That's why interviewers pay close attention to how you chunk data.

🤔 Why Chunking Exists:

LLMs cannot process entire documents at once.They:

Have context window limits
Perform better on complete semantic units
Fail when information is randomly split

Chunking solves this by breaking documents into retrievable, meaningful units.

Naive Chunking (The Wrong Way)

Naive chunking splits text into fixed-size pieces without preserving meaning or structure.

def naive_chunk(text, size=500):
    return [text[i:i+size] for i in range(0, len(text), size)]

This approach is easy, but problematic.

Why It's the "Wrong Way" in RAG

1. Context Gets Broken: A chunk may

Start in the middle of a sentence
End before an important conclusion

This confuses embeddings and reduces retrieval quality.

2. Important Information Is Split: Key ideas often span:

Multiple sentences
Entire paragraphs

Naive chunking separates them, producing partial answers.

3. No Logical Boundaries: Sections like

Definitions
Policies
Procedures

get mixed together randomly, making retrieval unreliable.

4. Poor Retrieval Results: The vector database may return

Half answers
Irrelevant fragments
Chunks missing critical context

This directly leads to hallucinations or vague responses.

Why Interviewers Call It "Naive":

Interviewers use this term because naive chunking shows lack of document understanding, ignores how LLMs reason, and is usually the first beginner mistake.

👉 Knowing why it fails is more important than knowing how to code it

When Naive Chunking Is Acceptable (Rare Cases):

Naive chunking may be acceptable only for:

Very short documents
Highly structured logs
Quick prototypes (never production)

Smart Chunking (The Right Way)

Smart chunking splits documents along semantic and structural boundaries, uses overlap, and ensures each chunk is meaningful on its own.

Goal of Smart Chunking:

Create chunks that:

Preserve meaning
Respect document structure
Are easy to retrieve
Fit within model limits

Each chunk should stand alone during retrieval.

What Smart Chunking Means:

Smart chunking focuses on:

Semantic boundaries (sentences, paragraphs, sections)
Logical completeness (a chunk answers something useful)
Controlled size (model-aware)
Overlap (to preserve context)

Key Principles of Smart Chunking

1. Chunk by Meaning, Not Just Size

Chunks should represent a complete thought, such as:

A policy rule
A definition
A procedure step
A paragraph or section

Avoid breaking sentences or ideas mid-way.

2. Respect Document Structure:

Use natural boundaries like:

Paragraph breaks
Headings
Page boundaries (especially for PDFs)

This significantly improves retrieval coherence.

3. Use Overlap to Preserve Context:

Overlap helps when:

Information spans chunk boundaries
Follow-up sentences depend on earlier ones

Best practice: 10–20% of chunk size

4. Keep Chunk Size Model-Aware

There is no single perfect chunk size. Common industry practice:

300–800 tokens per chunk

Chunk size depends on:

Embedding model limits
LLM context window
Retrieval strategy (top-K, reranking, hybrid search)

👉 Chunking must be model-aware, not arbitrary.

Example: Smart Chunking by Paragraph with Overlap

import re

def smart_chunk(
    text: str,
    chunk_size: int = 500,
    overlap: int = 100
):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= chunk_size:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())

            # overlap handling
            current_chunk = current_chunk[-overlap:] + " " + sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Apply Chunking to All Documents

all_chunks = []

for doc in docs:
    chunks = smart_chunk(doc["text"])
    
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "chunk_text": chunk,
            "metadata": {
                **doc["metadata"],
                "chunk_id": i
            }
        })

print(len(all_chunks))

Why This Works Better in RAG

Smart chunking is not just a preprocessing step—it directly determines retrieval quality, answer accuracy, and system reliability. Here's why it works better in real RAG systems.

1. Better Retrieval

When chunks contain complete ideas:

Each chunk represents a clear semantic unit
Embeddings capture the true meaning, not fragments
Similarity search becomes more accurate

As a result:

The vector database retrieves relevant chunks
Fewer irrelevant or partial matches are returned

👉 Good chunks = meaningful embeddings = better retrieval

2. Better Answers

With smart chunking:

Retrieved context is coherent and self-contained
LLMs don't need to "guess" missing information
Answers are more grounded in facts

This leads to:

Fewer hallucinations
More precise and complete responses
Better alignment with source documents

👉 LLMs perform best when context is logically complete

3. Easier Debugging (Very Important in Production)

Smart chunks make systems observable and debuggable.

You can:

Inspect a retrieved chunk
Clearly understand why it matched the query
Trace poor answers back to specific chunking issues

This is critical for:

Production troubleshooting
Retrieval tuning
Explaining system behavior to stakeholders

👉 If you can't explain why a chunk was retrieved, your RAG system is fragile.

3️⃣ Convert Chunks → Embeddings (From Scratch)

If RAG is the system, then embeddings are its foundation.

In fact, most RAG interview failures happen not because candidates don't know vector databases or LLMs — but because they don't truly understand embeddings.

Simple Definition

An embedding model converts text into a numerical vector that captures the semantic meaning of the text.

One-liner (memorize this)

Embeddings allow machines to compare meaning, not just words.

If you can confidently explain this, you've already cleared a major interview hurdle.

What Is an Embedding?

An embedding is:

A numerical vector
That represents the semantic meaning of text

In RAG systems:

Documents are converted into embeddings
User queries are converted into embeddings
Both use the same embedding model

This enables semantic search, not keyword matching.

🤔 Why Do We Even Need Embeddings?

Computers do not understand text the way humans do.

They understand numbers. So we convert:

Text → Numbers (Vectors)

Only then can a system:

Compare similarity
Search documents
Retrieve relevant knowledge

Without embeddings, RAG cannot exist.

Intuition: A Human Analogy

Think of embeddings like GPS coordinates for meaning.

Text	Meaning Position (Vector Space)
What is Q4 revenue?	(1.2, 0.9, …)
Q4 sales amount	(1.1, 0.88, …)
How to cook pasta?	(-2.4, 3.1, …)

👉

Similar meaning → vectors are close together

What Does an Embedding Look Like?

An embedding is simply a list of numbers:

[0.021, -0.87, 0.44, 0.19, ..., 0.33]

Typical Embedding Sizes

384 dimensions
768 dimensions
1536 dimensions

Higher dimensions usually mean:

More expressive meaning
Higher cost and memory usage

What Does an Embedding Model Do Internally?

At a high level, an embedding model:

Takes text as input
Tokenizes text
Passes tokens through a neural network
Produces a dense vector

Example: Text → Embedding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

text = "What is the Q4 revenue?"
embedding = model.encode(text)

print(len(embedding))
print(embedding[:8])

Output:

384
[ 0.012, -0.087, 0.443, 0.193, -0.021, 0.054, 0.311, -0.102 ]

This vector represents the meaning, not the words.

Why Similar Questions Have Similar Embeddings

sentences = [
    "What is Q4 revenue?",
    "Tell me the revenue of fourth quarter",
    "How to cook biryani?"
]

embeddings = model.encode(sentences, normalize_embeddings=True)

from numpy import dot

def similarity(a, b):
    return dot(a, b)

print(similarity(embeddings[0], embeddings[1]))
print(similarity(embeddings[0], embeddings[2]))

👉

High similarity = similar meaning. This is the core magic behind RAG retrieval.

Why Normalization Is Critical

Without normalization

Vector magnitudes differ
Similarity scores become unreliable

With normalization

            embedding = model.encode(

                text,

                normalize_embeddings=True

            )

✔ Enables cosine similarity
✔ Required by most vector databases

Embedding Model vs LLM

Aspect	Embedding Model	LLM
Purpose	Meaning representation	Text generation
Output	Vector (numbers)	Text
Used in	Search, RAG, clustering	Chat, reasoning
Cost	Low	High
Deterministic	Yes	No

👉

RAG uses both: Embeddings → retrieve, LLM → generate

Where Embeddings Are Used in RAG

Common Embedding Models

Model	Dimensions	Use Case
all-MiniLM-L6	384	Fast, low-cost
BGE-small	384	Search-optimized
BGE-large	1024	Enterprise RAG
OpenAI Ada	1536	High accuracy

4️⃣ Store Embeddings in Vector DB (Manual – Core RAG Fundamental)

After converting chunks into embeddings, the next critical step in RAG is storing those embeddings in a way that allows fast and accurate similarity search.

This is what a Vector Database does.

Important:

Even if you use FAISS, Pinecone, or Chroma in production, interviewers want you to understand this step from scratch.

What Is a Vector Database (Conceptually)?

A vector database stores:

Embeddings (numerical vectors)
Metadata (text, source, page, chunk id)
A way to compare vectors and return the most similar ones

At its core, a vector DB does three things:

Store vectors
Compare vectors
Rank results

We will implement all three manually.

Why Similarity Search Is Needed

When a user asks a question:

The question is converted into an embedding
That embedding is compared with all stored embeddings
The most similar chunks are retrieved

This is the retrieval part of RAG.

Cosine Similarity (Manual)

Why Cosine Similarity?

Cosine similarity measures direction, not magnitude.

It answers: "How similar is the meaning of these two texts?"

Mathematical Idea (Simple)

Value ranges from -1 to 1
Higher value = more similar meaning

Because we normalized embeddings earlier, cosine similarity becomes very simple.

Manual Implementation

def cosine_similarity(a, b):
    return np.dot(a, b)

Why This Works

a and b are normalized
Dot product = cosine similarity
Fast and reliable

Interview Tip

"Cosine similarity assumes normalized vectors."

Building a Simple Vector Store (From Scratch)

Now let's build a minimal vector database to store embeddings and metadata.

Step 1: Define the Vector Store Class

class SimpleVectorDB:
    def __init__(self):
        self.embeddings = []   # Stores vectors
        self.metadata = []     # Stores chunk info

Attribute	Purpose
embeddings	Numerical meaning
metadata	Text, source, chunk id

👉

Embeddings alone are useless without metadata

Step 2: Add Embeddings to the Store

    def add(self, embedding, meta):
        self.embeddings.append(embedding)
        self.metadata.append(meta)

What Happens Here

Each embedding is stored at the same index as its metadata
Index alignment is critical
This mimics how real vector DBs associate vectors with documents

📌 Production Insight

Losing metadata = losing explainability.

Searching the Vector Database (Core Retrieval Logic)

This is the heart of RAG retrieval.

Step 3: Search Method

    def search(self, query_embedding, top_k=3):
        scores = []

        for i, emb in enumerate(self.embeddings):
            score = cosine_similarity(query_embedding, emb)
            scores.append((score, i))

What's Happening

We compare the query embedding with every stored embedding
This is a linear scan (O(n))
Fine for learning, not for large-scale production

Step 4: Sort by Similarity

        scores.sort(reverse=True)

Highest similarity score first
Most relevant chunks appear at the top

Step 5: Return Top-K Results

        results = []

        for score, idx in scores[:top_k]:
            results.append({
                "score": float(score),
                "data": self.metadata[idx]
            })

        return results

Output Structure

Each result contains:

Similarity score
Original chunk + metadata

This is what is passed to the LLM in RAG

✅ Store All Embeddings (Indexing Phase)

vector_db = SimpleVectorDB()
for emb, chunk in zip(embeddings, all_chunks):
    vector_db.add(emb, chunk)

What This Represents in RAG

This is the offline indexing phase.

Done once (or periodically)
Not during user queries
Real systems persist this to disk or cloud

End-to-End Retrieval Example

query = "What is the Q4 revenue?"

query_embedding = model.encode(
    query,
    normalize_embeddings=True
)

results = vector_db.search(query_embedding, top_k=2)

for r in results:
    print("Score:", r["score"])
    print("Text:", r["data"]["chunk_text"][:150])

Sample Output

Score: 0.84
Text: Q4 revenue for the financial year was reported as $250 million...

Score: 0.79
Text: The company achieved strong growth in the fourth quarter...

👉

These chunks are now sent to the LLM.

🚀 You're Now RAG Interview-Ready!

Remember: When they ask about AI hallucinations, outdated knowledge, or private data...
You now have the complete RAG answer.

📚 Continue Your Learning Journey

Explore more topics on our blog:

Happy Coding! 👨‍💻👩‍💻

Go build something amazing with your new RAG knowledge!

RAG Architecture Explained from Scratch (No Frameworks, Diagram & Code)