RAG Architecture Explained from Scratch (No Frameworks, Diagram & Code)

How to Explain RAG Architecture in Interview (With Diagram & Code)

Last month, three different companies asked me the same interview question: "How do we stop our AI from making things up?" My answer was always the same four letters: R-A-G. Candidates who could clearly explain RAG architecture received offers almost 50% faster than those who couldn't.

Introduction: Why Do Interviewers Ask About RAG?

In Generative AI interviews today, one question comes up again and again:

"Can you explain RAG architecture, and why do we need it?"

If a candidate talks only about LLMs, interviewers quickly push further:

  • How do you handle hallucinations?
  • What about outdated knowledge?
  • How do you use enterprise or private data safely?

This is where Retrieval-Augmented Generation (RAG) becomes essential.

RAG directly addresses all of these problems—and that's why interviewers care so much about it.

Learn RAG Fundamentals — Manual Implementation Without Frameworks

What Is RAG (Retrieval-Augmented Generation)?

Retrieval-Augmented Generation (RAG) is a technique that combines external knowledge sources with Large Language Models (LLMs) to produce more accurate, grounded, and reliable responses.

Instead of relying only on what the model learned during training, RAG allows the model to look up relevant information at query time.

Simple Definition

RAG is an architecture where an LLM first retrieves relevant information from external data sources and then uses that information to generate a response.

This one sentence alone is often enough to pass the first round of GenAI interviews.

Problem with Normal LLMs (Why RAG Is Needed)

Traditional LLMs work only on pre-trained knowledge, which leads to several limitations:

Issue Normal LLM
Hallucinations Yes
Real-time / updated data No
Private PDFs & documents No
Enterprise data usage Risky
👉
RAG addresses all of these limitations by grounding responses in retrieved data.

RAG Architecture 

RAG Architecture Diagram 

Data Sources
Embedding Model
Vector Database
Retriever
LLM Generator

The image represents a standard, production-grade RAG pipeline. Let's break it down step by step.

Core Components of RAG

  1. Data Source (PDFs, Docs, URLs)
  2. Embedding Model
  3. Vector Database
  4. Retriever
  5. LLM (Generator)

RAG Has Two Mandatory Phases

Phase 1: Indexing (Offline / One-Time)

  • Load documents from various sources
  • Split documents into chunks
  • Convert chunks to embeddings
  • Store embeddings in vector database

Phase 2: Retrieval + Generation (Runtime)

  • User query comes in
  • Convert query to embedding
  • Retrieve relevant chunks from vector DB
  • Generate response using LLM with retrieved context

Phase 1: Indexing (Offline)

We'll go through all four steps in detail without using a framework, and we'll explain why the industry uses RAG with frameworks.

  1. Load documents
  2. Split into chunks
  3. Convert chunks → embeddings
  4. Store embeddings in Vector DB

Without Phase-1, RAG does NOT exist

1️⃣ Load Documents (PDF, TXT, Multiple Files)

Convert raw files (PDFs, text files, etc.) into:

  • Clean, usable text
  • Structured metadata

This step is the foundation of the RAG indexing phase.

If document loading is weak, everything downstream (chunking, retrieval, generation) breaks.

Challenges Interviewers Expect You to Mention

When loading documents for RAG, interviewers look for awareness of real-world issues:

  • Encoding issues (UTF-8 vs other encodings)
  • Page separation (especially in PDFs)
  • Metadata preservation (source, page number, department, year)
  • Multiple file formats (PDF, TXT, DOC, HTML)

Mentioning these signals production-level thinking, not tutorial-level knowledge.

Example: Loading PDF and TXT Files (Manual Approach)

Below is a framework-free, manual implementation for loading PDF and TXT documents while preserving metadata.

import os
from typing import List, Dict

import PyPDF2

def load_documents(folder_path: str) -> List[Dict]:
    """
    Load PDF and TXT documents.
    Returns list of dicts with text + metadata
    """
    documents = []

    for file_name in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file_name)

        if file_name.endswith(".txt"):
            with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
                text = f.read()

            documents.append({
                "text": text,
                "metadata": {
                    "source": file_name,
                    "type": "txt"
                }
            })

        elif file_name.endswith(".pdf"):
            with open(file_path, "rb") as f:
                reader = PyPDF2.PdfReader(f)

                for page_num, page in enumerate(reader.pages):
                    text = page.extract_text() or ""

                    documents.append({
                        "text": text,
                        "metadata": {
                            "source": file_name,
                            "page": page_num,
                            "type": "pdf"
                        }
                    })

    return documents

Usage & Output

docs = load_documents("./documents")

print(len(docs))
print(docs[0]["metadata"])

Output:

12
{'source': 'company_policy.pdf', 'page': 0, 'type': 'pdf'}

Each page becomes a separate document, which significantly improves:

  • Chunking accuracy
  • Retrieval precision
  • Answer traceability

Why Metadata Is NOT Optional

Metadata is critical in real-world RAG systems. Without metadata:

  • No filtering by department, year, or source
  • No document-level traceability
  • No explainability for answers
  • Hard to debug incorrect responses

With metadata, you can:

  • Filter documents (department = finance)
  • Restrict search (year = 2024)
  • Show sources in answers (Based on company_policy.pdf, page 3)

👉 Enterprise RAG without metadata is incomplete and unsafe.

2️⃣ Split Text Into Chunks (Critical Step)

Chunking is one of the most important and most misunderstood steps in Retrieval-Augmented Generation. If chunking is wrong:

  • Retrieval quality drops
  • Answers become incomplete
  • Hallucinations increase

That's why interviewers pay close attention to how you chunk data.

🤔 Why Chunking Exists:

LLMs cannot process entire documents at once.They:

  • Have context window limits
  • Perform better on complete semantic units
  • Fail when information is randomly split

Chunking solves this by breaking documents into retrievable, meaningful units.

Naive Chunking (The Wrong Way)

Naive chunking splits text into fixed-size pieces without preserving meaning or structure.

def naive_chunk(text, size=500):
    return [text[i:i+size] for i in range(0, len(text), size)]

This approach is easy, but problematic.

Why It's the "Wrong Way" in RAG

1. Context Gets Broken: A chunk may

  • Start in the middle of a sentence
  • End before an important conclusion

This confuses embeddings and reduces retrieval quality.

2. Important Information Is Split: Key ideas often span:

  • Multiple sentences
  • Entire paragraphs

Naive chunking separates them, producing partial answers.

3. No Logical Boundaries: Sections like

  • Definitions
  • Policies
  • Procedures

get mixed together randomly, making retrieval unreliable.

4. Poor Retrieval Results: The vector database may return

  • Half answers
  • Irrelevant fragments
  • Chunks missing critical context

This directly leads to hallucinations or vague responses.

Why Interviewers Call It "Naive":

Interviewers use this term because naive chunking shows lack of document understanding, ignores how LLMs reason, and is usually the first beginner mistake.

👉 Knowing why it fails is more important than knowing how to code it

When Naive Chunking Is Acceptable (Rare Cases):

Naive chunking may be acceptable only for:

  • Very short documents
  • Highly structured logs
  • Quick prototypes (never production)

Smart Chunking (The Right Way)

Smart chunking splits documents along semantic and structural boundaries, uses overlap, and ensures each chunk is meaningful on its own.

Goal of Smart Chunking:

Create chunks that:

  • Preserve meaning
  • Respect document structure
  • Are easy to retrieve
  • Fit within model limits

Each chunk should stand alone during retrieval.

What Smart Chunking Means:

Smart chunking focuses on:

  • Semantic boundaries (sentences, paragraphs, sections)
  • Logical completeness (a chunk answers something useful)
  • Controlled size (model-aware)
  • Overlap (to preserve context)

Key Principles of Smart Chunking

1. Chunk by Meaning, Not Just Size

Chunks should represent a complete thought, such as:

  • A policy rule
  • A definition
  • A procedure step
  • A paragraph or section

Avoid breaking sentences or ideas mid-way.

2. Respect Document Structure:

Use natural boundaries like:

  • Paragraph breaks
  • Headings
  • Page boundaries (especially for PDFs)

This significantly improves retrieval coherence.

3. Use Overlap to Preserve Context:

Overlap helps when:

  • Information spans chunk boundaries
  • Follow-up sentences depend on earlier ones

Best practice: 10–20% of chunk size

4. Keep Chunk Size Model-Aware

There is no single perfect chunk size. Common industry practice:

300–800 tokens per chunk

Chunk size depends on:

  • Embedding model limits
  • LLM context window
  • Retrieval strategy (top-K, reranking, hybrid search)

👉 Chunking must be model-aware, not arbitrary.

Example: Smart Chunking by Paragraph with Overlap

import re

def smart_chunk(
    text: str,
    chunk_size: int = 500,
    overlap: int = 100
):
    sentences = re.split(r'(?<=[.!?])\s+', text)
    
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) <= chunk_size:
            current_chunk += " " + sentence
        else:
            chunks.append(current_chunk.strip())

            # overlap handling
            current_chunk = current_chunk[-overlap:] + " " + sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Apply Chunking to All Documents

all_chunks = []

for doc in docs:
    chunks = smart_chunk(doc["text"])
    
    for i, chunk in enumerate(chunks):
        all_chunks.append({
            "chunk_text": chunk,
            "metadata": {
                **doc["metadata"],
                "chunk_id": i
            }
        })

print(len(all_chunks))

Why This Works Better in RAG

Smart chunking is not just a preprocessing step—it directly determines retrieval quality, answer accuracy, and system reliability. Here's why it works better in real RAG systems.

1. Better Retrieval

When chunks contain complete ideas:

  • Each chunk represents a clear semantic unit
  • Embeddings capture the true meaning, not fragments
  • Similarity search becomes more accurate

As a result:

  • The vector database retrieves relevant chunks
  • Fewer irrelevant or partial matches are returned

👉 Good chunks = meaningful embeddings = better retrieval

2. Better Answers

With smart chunking:

  • Retrieved context is coherent and self-contained
  • LLMs don't need to "guess" missing information
  • Answers are more grounded in facts

This leads to:

  • Fewer hallucinations
  • More precise and complete responses
  • Better alignment with source documents

👉 LLMs perform best when context is logically complete

3. Easier Debugging (Very Important in Production)

Smart chunks make systems observable and debuggable.

You can:

  • Inspect a retrieved chunk
  • Clearly understand why it matched the query
  • Trace poor answers back to specific chunking issues

This is critical for:

  • Production troubleshooting
  • Retrieval tuning
  • Explaining system behavior to stakeholders

👉 If you can't explain why a chunk was retrieved, your RAG system is fragile.

3️⃣ Convert Chunks → Embeddings (From Scratch)

If RAG is the system, then embeddings are its foundation.

In fact, most RAG interview failures happen not because candidates don't know vector databases or LLMs — but because they don't truly understand embeddings.

Simple Definition

An embedding model converts text into a numerical vector that captures the semantic meaning of the text.

One-liner (memorize this)

Embeddings allow machines to compare meaning, not just words.

If you can confidently explain this, you've already cleared a major interview hurdle.

What Is an Embedding?

An embedding is:

  • A numerical vector
  • That represents the semantic meaning of text

In RAG systems:

  • Documents are converted into embeddings
  • User queries are converted into embeddings
  • Both use the same embedding model

This enables semantic search, not keyword matching.

🤔 Why Do We Even Need Embeddings?

Computers do not understand text the way humans do.

They understand numbers. So we convert:

Text → Numbers (Vectors)

Only then can a system:

  • Compare similarity
  • Search documents
  • Retrieve relevant knowledge

Without embeddings, RAG cannot exist.

Intuition: A Human Analogy

Think of embeddings like GPS coordinates for meaning.

Text Meaning Position (Vector Space)
What is Q4 revenue? (1.2, 0.9, …)
Q4 sales amount (1.1, 0.88, …)
How to cook pasta? (-2.4, 3.1, …)
👉
Similar meaning → vectors are close together

What Does an Embedding Look Like?

An embedding is simply a list of numbers:

[0.021, -0.87, 0.44, 0.19, ..., 0.33]

Typical Embedding Sizes

  • 384 dimensions
  • 768 dimensions
  • 1536 dimensions

Higher dimensions usually mean:

  • More expressive meaning
  • Higher cost and memory usage

What Does an Embedding Model Do Internally?

At a high level, an embedding model:

  1. Takes text as input
  2. Tokenizes text
  3. Passes tokens through a neural network
  4. Produces a dense vector

Example: Text → Embedding

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

text = "What is the Q4 revenue?"
embedding = model.encode(text)

print(len(embedding))
print(embedding[:8])

Output:

384
[ 0.012, -0.087, 0.443, 0.193, -0.021, 0.054, 0.311, -0.102 ]

This vector represents the meaning, not the words.

Why Similar Questions Have Similar Embeddings

sentences = [
    "What is Q4 revenue?",
    "Tell me the revenue of fourth quarter",
    "How to cook biryani?"
]

embeddings = model.encode(sentences, normalize_embeddings=True)

from numpy import dot

def similarity(a, b):
    return dot(a, b)

print(similarity(embeddings[0], embeddings[1]))
print(similarity(embeddings[0], embeddings[2]))
👉
High similarity = similar meaning. This is the core magic behind RAG retrieval.

Why Normalization Is Critical

Without normalization
  • Vector magnitudes differ
  • Similarity scores become unreliable
With normalization
embedding = model.encode(
    text,
    normalize_embeddings=True
)
  • ✔ Enables cosine similarity
  • ✔ Required by most vector databases

Embedding Model vs LLM

Aspect Embedding Model LLM
Purpose Meaning representation Text generation
Output Vector (numbers) Text
Used in Search, RAG, clustering Chat, reasoning
Cost Low High
Deterministic Yes No
👉
RAG uses both: Embeddings → retrieve, LLM → generate

Where Embeddings Are Used in RAG

Common Embedding Models

Model Dimensions Use Case
all-MiniLM-L6 384 Fast, low-cost
BGE-small 384 Search-optimized
BGE-large 1024 Enterprise RAG
OpenAI Ada 1536 High accuracy

4️⃣ Store Embeddings in Vector DB (Manual – Core RAG Fundamental)

After converting chunks into embeddings, the next critical step in RAG is storing those embeddings in a way that allows fast and accurate similarity search.

This is what a Vector Database does.

Important:

Even if you use FAISS, Pinecone, or Chroma in production, interviewers want you to understand this step from scratch.

What Is a Vector Database (Conceptually)?

A vector database stores:

  • Embeddings (numerical vectors)
  • Metadata (text, source, page, chunk id)
  • A way to compare vectors and return the most similar ones

At its core, a vector DB does three things:

  1. Store vectors
  2. Compare vectors
  3. Rank results

We will implement all three manually.

Why Similarity Search Is Needed

When a user asks a question:

  1. The question is converted into an embedding
  2. That embedding is compared with all stored embeddings
  3. The most similar chunks are retrieved

This is the retrieval part of RAG.

Cosine Similarity (Manual)

Why Cosine Similarity?

Cosine similarity measures direction, not magnitude.

It answers: "How similar is the meaning of these two texts?"

Mathematical Idea (Simple)

  • Value ranges from -1 to 1
  • Higher value = more similar meaning

Because we normalized embeddings earlier, cosine similarity becomes very simple.

Manual Implementation
def cosine_similarity(a, b):
    return np.dot(a, b)

Why This Works

  • a and b are normalized
  • Dot product = cosine similarity
  • Fast and reliable

Interview Tip

"Cosine similarity assumes normalized vectors."

Building a Simple Vector Store (From Scratch)

Now let's build a minimal vector database to store embeddings and metadata.

Step 1: Define the Vector Store Class
class SimpleVectorDB:
    def __init__(self):
        self.embeddings = []   # Stores vectors
        self.metadata = []     # Stores chunk info
Attribute Purpose
embeddings Numerical meaning
metadata Text, source, chunk id
👉
Embeddings alone are useless without metadata
Step 2: Add Embeddings to the Store
    def add(self, embedding, meta):
        self.embeddings.append(embedding)
        self.metadata.append(meta)

What Happens Here

  • Each embedding is stored at the same index as its metadata
  • Index alignment is critical
  • This mimics how real vector DBs associate vectors with documents

📌 Production Insight

Losing metadata = losing explainability.

Searching the Vector Database (Core Retrieval Logic)

This is the heart of RAG retrieval.

Step 3: Search Method

    def search(self, query_embedding, top_k=3):
        scores = []

        for i, emb in enumerate(self.embeddings):
            score = cosine_similarity(query_embedding, emb)
            scores.append((score, i))

What's Happening

  • We compare the query embedding with every stored embedding
  • This is a linear scan (O(n))
  • Fine for learning, not for large-scale production

Step 4: Sort by Similarity

        scores.sort(reverse=True)
  • Highest similarity score first
  • Most relevant chunks appear at the top

Step 5: Return Top-K Results

        results = []

        for score, idx in scores[:top_k]:
            results.append({
                "score": float(score),
                "data": self.metadata[idx]
            })

        return results

Output Structure

Each result contains:

  • Similarity score
  • Original chunk + metadata

This is what is passed to the LLM in RAG

✅ Store All Embeddings (Indexing Phase)
vector_db = SimpleVectorDB()
for emb, chunk in zip(embeddings, all_chunks):
    vector_db.add(emb, chunk)

What This Represents in RAG

This is the offline indexing phase.

  • Done once (or periodically)
  • Not during user queries
  • Real systems persist this to disk or cloud

End-to-End Retrieval Example

query = "What is the Q4 revenue?"

query_embedding = model.encode(
    query,
    normalize_embeddings=True
)

results = vector_db.search(query_embedding, top_k=2)

for r in results:
    print("Score:", r["score"])
    print("Text:", r["data"]["chunk_text"][:150])
Sample Output
Score: 0.84
Text: Q4 revenue for the financial year was reported as $250 million...

Score: 0.79
Text: The company achieved strong growth in the fourth quarter...
👉
These chunks are now sent to the LLM.

🚀 You're Now RAG Interview-Ready!

Remember: When they ask about AI hallucinations, outdated knowledge, or private data...
You now have the complete RAG answer.

Happy Coding! 👨‍💻👩‍💻

Go build something amazing with your new RAG knowledge!

Post a Comment

0 Comments