RAG & LLM System
Problem
Design a Retrieval-Augmented Generation (RAG) system — an LLM-powered Q&A that grounds answers in a private knowledge base.
Why It Matters for You
Direct relevance: AWS Bedrock LLM chat application at HCLTech. The Flask + Bedrock app you built IS a RAG system. Use as interview war story.
Functional Requirements
- Ingest documents from various sources (PDF, web, DB)
- Answer user questions using an LLM grounded in ingested docs
- Keep knowledge base up to date (re-indexing)
Non-Functional Requirements
- Low latency response (< 3s end-to-end)
- Accurate retrieval (relevance)
- Scalable to large corpora
High-Level Design
Documents → Chunker → Embedder → Vector DB (Pinecone/Weaviate/pgvector)
↓
User Query → Embedder → Similarity Search → Top-K Chunks
↓
LLM (Bedrock/GPT) + Chunks → Answer
Key Components
| Component | Options |
|---|---|
| Embedding Model | OpenAI ada-002, HuggingFace |
| Vector DB | Pinecone, Weaviate, pgvector, FAISS |
| LLM | AWS Bedrock (Claude), GPT-4, Llama |
| Orchestration | LangChain, LlamaIndex |
| Chunking Strategy | Fixed size, recursive, semantic |
Key Tradeoffs
- Chunk size — small = precise retrieval, large = more context
- Retrieval — dense (vector similarity) vs sparse (BM25) vs hybrid
- Re-ranking — add a cross-encoder after retrieval for better accuracy
- Caching — cache embeddings + common query results