System Design Basics
HLD Process Flow
flowchart TD
Requirements["Requirements <br/> Functional & Non-Functional"]
HLD["High-Level Design <br/> Architecture & Components"]
DataModel["Data Model & Flow <br/> Entities & Interactions"]
NFR["Non-Functional Requirements <br/> Quality Attributes"]
subgraph Quality Pillars
Scalability["Scalability <br/> Handle Growth"]
Reliability["Reliability <br/> Correct Under Failure"]
Availability["Availability <br/> Minimize Downtime"]
Maintainability["Maintainability <br/> Easy to Change"]
end
Requirements --> HLD
HLD --> DataModel
DataModel --> NFR
NFR --> Scalability
NFR --> Reliability
NFR --> Availability
NFR --> Maintainability
Scalability -. drives .-> HLD
Reliability -. drives .-> HLD
Availability -. drives .-> HLD
Maintainability -. drives .-> HLD
Evaluation Criteria
Any system design solution should be evaluated on 3 criteria:
- Simplicity — Is it as simple as it needs to be?
- Fidelity — Does it cover all functional and non-functional requirements?
- Cost Effectiveness — Is the infra spend justified?
The core principle of a Software Architect is to solve the business problem — not to build the most sophisticated system.
Interview Framework
- Clarify requirements — functional (what it does) + non-functional (scale, latency, availability)
- Estimate scale — QPS, storage, bandwidth
- High-level design — components, data flow, APIs
- Deep dive — bottlenecks, tradeoffs, failure handling
- Wrap up — summarise decisions, open questions
Key Vocabulary
| Term | Definition |
|---|---|
| API | Application Programming Interface — how systems talk to each other |
| API Contract | Defines request format and expected response shape |
| SLA | Service Level Agreement — uptime/latency guarantee |
| SLO | Service Level Objective — internal target (e.g., 99.9% uptime) |
| Throughput | Requests per second the system can handle |
| Latency | Time from request to response |
| Availability | % of time the system is operational (99.9% = 8.7 hrs downtime/yr) |
| Consistency | All nodes see the same data at the same time |
| Partition Tolerance | System continues despite network splits (CAP theorem) |
Quality Pillars
Scalability — Design for Growth
- Ability to handle increasing load by adding resources.
- Vertical scaling — bigger machine (more CPU/RAM). Simple but hits physical limits.
- Horizontal scaling — more machines. How internet giants operate, but requires designing for distribution from the start.
Reliability — Embrace Failure
- System continues to work correctly even when components fail — and they always will (drives, networks, power).
- Achieved via replication, redundancy, and graceful degradation.
- Goal: prevent failures from cascading into total collapse, not prevent all failures.
Availability — Minimize Downtime
- % of time the system is operational. "Five nines" (99.999%) = 5.26 min downtime/year.
- Requires: no single points of failure, load balancers to route around unhealthy instances, fast recovery.
- Trades off with consistency — during a partition, you may serve stale data rather than return errors.
Maintainability — Design for Change
- How easily engineers can understand, modify, and debug the system over time.
- Requires: clear interfaces, good observability (logging, metrics, tracing), modular architecture.
- Poor maintainability is why companies do expensive rewrites.
CAP Theorem
A distributed system can guarantee at most 2 of 3:
- Consistency — every read gets the latest write
- Availability — every request gets a response
- Partition Tolerance — works despite network failures
In practice: P is non-negotiable → choose CP (banks) or AP (social media)
flowchart TD
subgraph P["Network Partition Occurs"]
Node1["Node 1 - Data v2"]
Split["❌ Network Split <br/> Nodes cant sync"]
Node2["Node 2 - Data v1"]
Client["Client Request"]
Node1 -.-> Split
Split -.-> Node2
end
subgraph B["Choice B: Prioritize Availability"]
CB_Client["Client"]
CB_Node2["Node 2"]
CB_Result["✓ High Availability <br/> ❌ Eventual Consistency <br/> Ex: Social media feeds"]
CB_Client -->|Read request| CB_Node2
CB_Node2 -->|Return v1 stale| CB_Client
end
subgraph A["Choice A: Prioritize Consistency"]
CA_Client["Client"]
CA_Node2["Node 2"]
CA_Result["✓ Strong Consistency <br/> ❌ Lower Availability <br/> Ex: Banking systems"]
CA_Client -->|Read request| CA_Node2
CA_Node2 -->|Return error| CA_Client
end
Client -.->|Forces choice| CB_Client
Client -.->|Forces choice| CA_Client
Key Tradeoffs
Consistency vs. Availability
| Strong Consistency | Eventual Consistency | |
|---|---|---|
| Guarantee | All reads return latest write | Replicas converge eventually |
| Latency | Higher (coordination needed) | Lower (serve any replica) |
| Use case | Banking, payments | Social feeds, caching, analytics |
Latency vs. Throughput
| Low Latency | High Throughput | |
|---|---|---|
| Goal | Fast single request | Max requests/sec |
| Technique | Caching, in-memory, fewer hops | Batching, queuing |
| Use case | Gaming, video calls | Data pipelines, batch jobs |
Sometimes you need both — separate concerns into different components (e.g., Netflix: low latency for playback, high throughput for encoding).
Scalability Evolution
flowchart TD
subgraph M["1K Users: Monolith"]
M1["Single Server <br/> App + DB"]
M1_Users["~10 req/sec"]
end
subgraph V["10K Users: Vertical Scale + Cache"]
V1["Web Server <br/> Bigger instance"]
V2["Read Replica <br/> PostgreSQL"]
V3["Primary DB <br/> PostgreSQL"]
V4["Redis Cache"]
V1 --> V2
V1 --> V3
V1 --> V4
end
subgraph H["100K Users: Horizontal Scale"]
H1["Load Balancer"]
H2["App Server 1"]
H3["App Server 2"]
H4["App Server 3"]
H5["Redis Cluster"]
H6["Primary DB"]
H7["Read Replicas"]
H8["CDN <br/> Static Assets"]
H1 --> H2 & H3 & H4
H2 & H3 & H4 --> H5 & H6 & H7
end
subgraph D["1M+ Users: Distributed Systems"]
D1["Global CDN"]
D2["API Gateway"]
D3["Microservice 1"]
D4["Microservice 2"]
D5["Microservice 3"]
D6["Message Queue <br/> Kafka"]
D7["DB Shard 1"]
D8["DB Shard 2"]
D9["DB Shard 3"]
D10["Distributed Cache <br/> Multi-region"]
D1 --> D2
D2 --> D3 & D4 & D5
D3 & D4 & D5 --> D6
D3 --> D7
D4 --> D8
D5 --> D9
D3 & D4 & D5 --> D10
end
M1 -. "Scale up" .-> V1
V1 -. "Scale out" .-> H1
H1 -. "Distribute" .-> D1
| Scale | Pattern | Key Addition |
|---|---|---|
| 1K | Monolith | Single server |
| 10K | Vertical scale | Bigger machine + Read replica + Cache |
| 100K | Horizontal scale | Load balancer + Multiple app servers + CDN |
| 1M+ | Distributed | Microservices + Kafka + DB sharding + Multi-region cache |
Common Pitfalls
| Pitfall | Cause | How to Avoid |
|---|---|---|
| Premature optimisation | Designing for billions when you have thousands | Start simple, scale at actual limits. Instagram stayed monolith until millions of users |
| Ignoring operational complexity | Elegant whiteboard → nightmare in prod | Consider: who's on-call? how to debug? blast radius? Discuss monitoring in interviews |
| Designing without requirements | Jumping to solutions without constraints | Spend first 10 min clarifying: users, read/write ratio, latency needs, consistency needs |