How to Explain RAG Architecture in Interview (With Diagram & Code)
Last month, three different companies asked me the same interview question: "How do we stop our AI from making things up?" My answer was always the same four letters: R-A-G. Candidates who could clearly explain RAG architecture received offers almost 50% faster than those who couldn't.
Introduction: Why Do Interviewers Ask About RAG?
In Generative AI interviews today, one question comes up again and again:
"Can you explain RAG architecture, and why do we need it?"
If a candidate talks only about LLMs, interviewers quickly push further:
- How do you handle hallucinations?
- What about outdated knowledge?
- How do you use enterprise or private data safely?
This is where Retrieval-Augmented Generation (RAG) becomes essential.
RAG directly addresses all of these problems—and that's why interviewers care so much about it.
Learn RAG Fundamentals — Manual Implementation Without Frameworks
What Is RAG (Retrieval-Augmented Generation)?
Retrieval-Augmented Generation (RAG) is a technique that combines external knowledge sources with Large Language Models (LLMs) to produce more accurate, grounded, and reliable responses.
Instead of relying only on what the model learned during training, RAG allows the model to look up relevant information at query time.
Simple Definition
RAG is an architecture where an LLM first retrieves relevant information from external data sources and then uses that information to generate a response.
This one sentence alone is often enough to pass the first round of GenAI interviews.
Problem with Normal LLMs (Why RAG Is Needed)
Traditional LLMs work only on pre-trained knowledge, which leads to several limitations:
| Issue | Normal LLM |
|---|---|
| Hallucinations | Yes |
| Real-time / updated data | No |
| Private PDFs & documents | No |
| Enterprise data usage | Risky |
RAG Architecture
RAG Architecture Diagram
The image represents a standard, production-grade RAG pipeline. Let's break it down step by step.
Core Components of RAG
- Data Source (PDFs, Docs, URLs)
- Embedding Model
- Vector Database
- Retriever
- LLM (Generator)
RAG Has Two Mandatory Phases
Phase 1: Indexing (Offline / One-Time)
- Load documents from various sources
- Split documents into chunks
- Convert chunks to embeddings
- Store embeddings in vector database
Phase 2: Retrieval + Generation (Runtime)
- User query comes in
- Convert query to embedding
- Retrieve relevant chunks from vector DB
- Generate response using LLM with retrieved context
Phase 1: Indexing (Offline)
We'll go through all four steps in detail without using a framework, and we'll explain why the industry uses RAG with frameworks.
- Load documents
- Split into chunks
- Convert chunks → embeddings
- Store embeddings in Vector DB
Without Phase-1, RAG does NOT exist
1️⃣ Load Documents (PDF, TXT, Multiple Files)
Convert raw files (PDFs, text files, etc.) into:
- Clean, usable text
- Structured metadata
This step is the foundation of the RAG indexing phase.
If document loading is weak, everything downstream (chunking, retrieval, generation) breaks.
Challenges Interviewers Expect You to Mention
When loading documents for RAG, interviewers look for awareness of real-world issues:
- Encoding issues (UTF-8 vs other encodings)
- Page separation (especially in PDFs)
- Metadata preservation (source, page number, department, year)
- Multiple file formats (PDF, TXT, DOC, HTML)
Mentioning these signals production-level thinking, not tutorial-level knowledge.
Example: Loading PDF and TXT Files (Manual Approach)
Below is a framework-free, manual implementation for loading PDF and TXT documents while preserving metadata.
import os
from typing import List, Dict
import PyPDF2
def load_documents(folder_path: str) -> List[Dict]:
"""
Load PDF and TXT documents.
Returns list of dicts with text + metadata
"""
documents = []
for file_name in os.listdir(folder_path):
file_path = os.path.join(folder_path, file_name)
if file_name.endswith(".txt"):
with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
text = f.read()
documents.append({
"text": text,
"metadata": {
"source": file_name,
"type": "txt"
}
})
elif file_name.endswith(".pdf"):
with open(file_path, "rb") as f:
reader = PyPDF2.PdfReader(f)
for page_num, page in enumerate(reader.pages):
text = page.extract_text() or ""
documents.append({
"text": text,
"metadata": {
"source": file_name,
"page": page_num,
"type": "pdf"
}
})
return documents
Usage & Output
docs = load_documents("./documents")
print(len(docs))
print(docs[0]["metadata"])
Output:
12
{'source': 'company_policy.pdf', 'page': 0, 'type': 'pdf'}
Each page becomes a separate document, which significantly improves:
- Chunking accuracy
- Retrieval precision
- Answer traceability
Why Metadata Is NOT Optional
Metadata is critical in real-world RAG systems. Without metadata:
- No filtering by department, year, or source
- No document-level traceability
- No explainability for answers
- Hard to debug incorrect responses
With metadata, you can:
- Filter documents (department = finance)
- Restrict search (year = 2024)
- Show sources in answers (Based on company_policy.pdf, page 3)
👉 Enterprise RAG without metadata is incomplete and unsafe.
2️⃣ Split Text Into Chunks (Critical Step)
Chunking is one of the most important and most misunderstood steps in Retrieval-Augmented Generation. If chunking is wrong:
- Retrieval quality drops
- Answers become incomplete
- Hallucinations increase
That's why interviewers pay close attention to how you chunk data.
🤔 Why Chunking Exists:
LLMs cannot process entire documents at once.They:
- Have context window limits
- Perform better on complete semantic units
- Fail when information is randomly split
Chunking solves this by breaking documents into retrievable, meaningful units.
Naive Chunking (The Wrong Way)
Naive chunking splits text into fixed-size pieces without preserving meaning or structure.
def naive_chunk(text, size=500):
return [text[i:i+size] for i in range(0, len(text), size)]
This approach is easy, but problematic.
Why It's the "Wrong Way" in RAG
1. Context Gets Broken: A chunk may
- Start in the middle of a sentence
- End before an important conclusion
This confuses embeddings and reduces retrieval quality.
2. Important Information Is Split: Key ideas often span:
- Multiple sentences
- Entire paragraphs
Naive chunking separates them, producing partial answers.
3. No Logical Boundaries: Sections like
- Definitions
- Policies
- Procedures
get mixed together randomly, making retrieval unreliable.
4. Poor Retrieval Results: The vector database may return
- Half answers
- Irrelevant fragments
- Chunks missing critical context
This directly leads to hallucinations or vague responses.
Why Interviewers Call It "Naive":
Interviewers use this term because naive chunking shows lack of document understanding, ignores how LLMs reason, and is usually the first beginner mistake.
👉 Knowing why it fails is more important than knowing how to code it
When Naive Chunking Is Acceptable (Rare Cases):
Naive chunking may be acceptable only for:
- Very short documents
- Highly structured logs
- Quick prototypes (never production)
Smart Chunking (The Right Way)
Smart chunking splits documents along semantic and structural boundaries, uses overlap, and ensures each chunk is meaningful on its own.
Goal of Smart Chunking:
Create chunks that:
- Preserve meaning
- Respect document structure
- Are easy to retrieve
- Fit within model limits
Each chunk should stand alone during retrieval.
What Smart Chunking Means:
Smart chunking focuses on:
- Semantic boundaries (sentences, paragraphs, sections)
- Logical completeness (a chunk answers something useful)
- Controlled size (model-aware)
- Overlap (to preserve context)
Key Principles of Smart Chunking
1. Chunk by Meaning, Not Just Size
Chunks should represent a complete thought, such as:
- A policy rule
- A definition
- A procedure step
- A paragraph or section
Avoid breaking sentences or ideas mid-way.
2. Respect Document Structure:
Use natural boundaries like:
- Paragraph breaks
- Headings
- Page boundaries (especially for PDFs)
This significantly improves retrieval coherence.
3. Use Overlap to Preserve Context:
Overlap helps when:
- Information spans chunk boundaries
- Follow-up sentences depend on earlier ones
Best practice: 10–20% of chunk size
4. Keep Chunk Size Model-Aware
There is no single perfect chunk size. Common industry practice:
300–800 tokens per chunk
Chunk size depends on:
- Embedding model limits
- LLM context window
- Retrieval strategy (top-K, reranking, hybrid search)
👉 Chunking must be model-aware, not arbitrary.
Example: Smart Chunking by Paragraph with Overlap
import re
def smart_chunk(
text: str,
chunk_size: int = 500,
overlap: int = 100
):
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) <= chunk_size:
current_chunk += " " + sentence
else:
chunks.append(current_chunk.strip())
# overlap handling
current_chunk = current_chunk[-overlap:] + " " + sentence
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Apply Chunking to All Documents
all_chunks = []
for doc in docs:
chunks = smart_chunk(doc["text"])
for i, chunk in enumerate(chunks):
all_chunks.append({
"chunk_text": chunk,
"metadata": {
**doc["metadata"],
"chunk_id": i
}
})
print(len(all_chunks))
Why This Works Better in RAG
Smart chunking is not just a preprocessing step—it directly determines retrieval quality, answer accuracy, and system reliability. Here's why it works better in real RAG systems.
1. Better Retrieval
When chunks contain complete ideas:
- Each chunk represents a clear semantic unit
- Embeddings capture the true meaning, not fragments
- Similarity search becomes more accurate
As a result:
- The vector database retrieves relevant chunks
- Fewer irrelevant or partial matches are returned
👉 Good chunks = meaningful embeddings = better retrieval
2. Better Answers
With smart chunking:
- Retrieved context is coherent and self-contained
- LLMs don't need to "guess" missing information
- Answers are more grounded in facts
This leads to:
- Fewer hallucinations
- More precise and complete responses
- Better alignment with source documents
👉 LLMs perform best when context is logically complete
3. Easier Debugging (Very Important in Production)
Smart chunks make systems observable and debuggable.
You can:
- Inspect a retrieved chunk
- Clearly understand why it matched the query
- Trace poor answers back to specific chunking issues
This is critical for:
- Production troubleshooting
- Retrieval tuning
- Explaining system behavior to stakeholders
👉 If you can't explain why a chunk was retrieved, your RAG system is fragile.
3️⃣ Convert Chunks → Embeddings (From Scratch)
If RAG is the system, then embeddings are its foundation.
In fact, most RAG interview failures happen not because candidates don't know vector databases or LLMs — but because they don't truly understand embeddings.
Simple Definition
An embedding model converts text into a numerical vector that captures the semantic meaning of the text.
One-liner (memorize this)
Embeddings allow machines to compare meaning, not just words.
If you can confidently explain this, you've already cleared a major interview hurdle.
What Is an Embedding?
An embedding is:
- A numerical vector
- That represents the semantic meaning of text
In RAG systems:
- Documents are converted into embeddings
- User queries are converted into embeddings
- Both use the same embedding model
This enables semantic search, not keyword matching.
🤔 Why Do We Even Need Embeddings?
Computers do not understand text the way humans do.
They understand numbers. So we convert:
Text → Numbers (Vectors)
Only then can a system:
- Compare similarity
- Search documents
- Retrieve relevant knowledge
Without embeddings, RAG cannot exist.
Intuition: A Human Analogy
Think of embeddings like GPS coordinates for meaning.
| Text | Meaning Position (Vector Space) |
|---|---|
| What is Q4 revenue? | (1.2, 0.9, …) |
| Q4 sales amount | (1.1, 0.88, …) |
| How to cook pasta? | (-2.4, 3.1, …) |
What Does an Embedding Look Like?
An embedding is simply a list of numbers:
[0.021, -0.87, 0.44, 0.19, ..., 0.33]
Typical Embedding Sizes
- 384 dimensions
- 768 dimensions
- 1536 dimensions
Higher dimensions usually mean:
- More expressive meaning
- Higher cost and memory usage
What Does an Embedding Model Do Internally?
At a high level, an embedding model:
- Takes text as input
- Tokenizes text
- Passes tokens through a neural network
- Produces a dense vector
Example: Text → Embedding
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
text = "What is the Q4 revenue?"
embedding = model.encode(text)
print(len(embedding))
print(embedding[:8])
Output:
384 [ 0.012, -0.087, 0.443, 0.193, -0.021, 0.054, 0.311, -0.102 ]
This vector represents the meaning, not the words.
Why Similar Questions Have Similar Embeddings
sentences = [
"What is Q4 revenue?",
"Tell me the revenue of fourth quarter",
"How to cook biryani?"
]
embeddings = model.encode(sentences, normalize_embeddings=True)
from numpy import dot
def similarity(a, b):
return dot(a, b)
print(similarity(embeddings[0], embeddings[1]))
print(similarity(embeddings[0], embeddings[2]))
Why Normalization Is Critical
Without normalization
- Vector magnitudes differ
- Similarity scores become unreliable
With normalization
text,
normalize_embeddings=True
)
- ✔ Enables cosine similarity
- ✔ Required by most vector databases
Embedding Model vs LLM
| Aspect | Embedding Model | LLM |
|---|---|---|
| Purpose | Meaning representation | Text generation |
| Output | Vector (numbers) | Text |
| Used in | Search, RAG, clustering | Chat, reasoning |
| Cost | Low | High |
| Deterministic | Yes | No |
Where Embeddings Are Used in RAG
Common Embedding Models
| Model | Dimensions | Use Case |
|---|---|---|
| all-MiniLM-L6 | 384 | Fast, low-cost |
| BGE-small | 384 | Search-optimized |
| BGE-large | 1024 | Enterprise RAG |
| OpenAI Ada | 1536 | High accuracy |
4️⃣ Store Embeddings in Vector DB (Manual – Core RAG Fundamental)
After converting chunks into embeddings, the next critical step in RAG is storing those embeddings in a way that allows fast and accurate similarity search.
This is what a Vector Database does.
Important:
Even if you use FAISS, Pinecone, or Chroma in production, interviewers want you to understand this step from scratch.
What Is a Vector Database (Conceptually)?
A vector database stores:
- Embeddings (numerical vectors)
- Metadata (text, source, page, chunk id)
- A way to compare vectors and return the most similar ones
At its core, a vector DB does three things:
- Store vectors
- Compare vectors
- Rank results
We will implement all three manually.
Why Similarity Search Is Needed
When a user asks a question:
- The question is converted into an embedding
- That embedding is compared with all stored embeddings
- The most similar chunks are retrieved
This is the retrieval part of RAG.
Cosine Similarity (Manual)
Why Cosine Similarity?
Cosine similarity measures direction, not magnitude.
It answers: "How similar is the meaning of these two texts?"
Mathematical Idea (Simple)
- Value ranges from -1 to 1
- Higher value = more similar meaning
Because we normalized embeddings earlier, cosine similarity becomes very simple.
Manual Implementation
def cosine_similarity(a, b):
return np.dot(a, b)
Why This Works
- a and b are normalized
- Dot product = cosine similarity
- Fast and reliable
Interview Tip
"Cosine similarity assumes normalized vectors."
Building a Simple Vector Store (From Scratch)
Now let's build a minimal vector database to store embeddings and metadata.
Step 1: Define the Vector Store Class
class SimpleVectorDB:
def __init__(self):
self.embeddings = [] # Stores vectors
self.metadata = [] # Stores chunk info
| Attribute | Purpose |
|---|---|
| embeddings | Numerical meaning |
| metadata | Text, source, chunk id |
Step 2: Add Embeddings to the Store
def add(self, embedding, meta):
self.embeddings.append(embedding)
self.metadata.append(meta)
What Happens Here
- Each embedding is stored at the same index as its metadata
- Index alignment is critical
- This mimics how real vector DBs associate vectors with documents
📌 Production Insight
Losing metadata = losing explainability.
Searching the Vector Database (Core Retrieval Logic)
This is the heart of RAG retrieval.
Step 3: Search Method
def search(self, query_embedding, top_k=3):
scores = []
for i, emb in enumerate(self.embeddings):
score = cosine_similarity(query_embedding, emb)
scores.append((score, i))
What's Happening
- We compare the query embedding with every stored embedding
- This is a linear scan (O(n))
- Fine for learning, not for large-scale production
Step 4: Sort by Similarity
scores.sort(reverse=True)
- Highest similarity score first
- Most relevant chunks appear at the top
Step 5: Return Top-K Results
results = []
for score, idx in scores[:top_k]:
results.append({
"score": float(score),
"data": self.metadata[idx]
})
return results
Output Structure
Each result contains:
- Similarity score
- Original chunk + metadata
This is what is passed to the LLM in RAG
✅ Store All Embeddings (Indexing Phase)
vector_db = SimpleVectorDB()
for emb, chunk in zip(embeddings, all_chunks):
vector_db.add(emb, chunk)
What This Represents in RAG
This is the offline indexing phase.
- Done once (or periodically)
- Not during user queries
- Real systems persist this to disk or cloud
End-to-End Retrieval Example
query = "What is the Q4 revenue?"
query_embedding = model.encode(
query,
normalize_embeddings=True
)
results = vector_db.search(query_embedding, top_k=2)
for r in results:
print("Score:", r["score"])
print("Text:", r["data"]["chunk_text"][:150])
Sample Output
Score: 0.84 Text: Q4 revenue for the financial year was reported as $250 million... Score: 0.79 Text: The company achieved strong growth in the fourth quarter...
🚀 You're Now RAG Interview-Ready!
Remember: When they ask about AI hallucinations, outdated knowledge, or private data...
You now have the complete RAG answer.
📚 Continue Your Learning Journey
Explore more topics on our blog:
Happy Coding! 👨💻👩💻
Go build something amazing with your new RAG knowledge!

0 Comments