Back to Notes

RAG & LLM System

Problem

Design a Retrieval-Augmented Generation (RAG) system — an LLM-powered Q&A that grounds answers in a private knowledge base.


Why It Matters for You

Direct relevance: AWS Bedrock LLM chat application at HCLTech. The Flask + Bedrock app you built IS a RAG system. Use as interview war story.


Functional Requirements

  • Ingest documents from various sources (PDF, web, DB)
  • Answer user questions using an LLM grounded in ingested docs
  • Keep knowledge base up to date (re-indexing)

Non-Functional Requirements

  • Low latency response (< 3s end-to-end)
  • Accurate retrieval (relevance)
  • Scalable to large corpora

High-Level Design

Documents → Chunker → Embedder → Vector DB (Pinecone/Weaviate/pgvector)
                                          ↓
User Query → Embedder → Similarity Search → Top-K Chunks
                                          ↓
                             LLM (Bedrock/GPT) + Chunks → Answer

Key Components

ComponentOptions
Embedding ModelOpenAI ada-002, HuggingFace
Vector DBPinecone, Weaviate, pgvector, FAISS
LLMAWS Bedrock (Claude), GPT-4, Llama
OrchestrationLangChain, LlamaIndex
Chunking StrategyFixed size, recursive, semantic

Key Tradeoffs

  • Chunk size — small = precise retrieval, large = more context
  • Retrieval — dense (vector similarity) vs sparse (BM25) vs hybrid
  • Re-ranking — add a cross-encoder after retrieval for better accuracy
  • Caching — cache embeddings + common query results

Notes

<!-- Add as you study Gaurav Sen W8 — Sat May 9 -->