Back to Notes

System Design Basics

HLD Process Flow

flowchart TD
    Requirements["Requirements <br/> Functional & Non-Functional"]
    HLD["High-Level Design <br/> Architecture & Components"]
    DataModel["Data Model & Flow <br/> Entities & Interactions"]
    NFR["Non-Functional Requirements <br/> Quality Attributes"]

    subgraph Quality Pillars
        Scalability["Scalability <br/> Handle Growth"]
        Reliability["Reliability <br/> Correct Under Failure"]
        Availability["Availability <br/> Minimize Downtime"]
        Maintainability["Maintainability <br/> Easy to Change"]
    end

    Requirements --> HLD
    HLD --> DataModel
    DataModel --> NFR
    NFR --> Scalability
    NFR --> Reliability
    NFR --> Availability
    NFR --> Maintainability

    Scalability -. drives .-> HLD
    Reliability -. drives .-> HLD
    Availability -. drives .-> HLD
    Maintainability -. drives .-> HLD

Evaluation Criteria

Any system design solution should be evaluated on 3 criteria:

  1. Simplicity — Is it as simple as it needs to be?
  2. Fidelity — Does it cover all functional and non-functional requirements?
  3. Cost Effectiveness — Is the infra spend justified?

The core principle of a Software Architect is to solve the business problem — not to build the most sophisticated system.


Interview Framework

  1. Clarify requirements — functional (what it does) + non-functional (scale, latency, availability)
  2. Estimate scale — QPS, storage, bandwidth
  3. High-level design — components, data flow, APIs
  4. Deep dive — bottlenecks, tradeoffs, failure handling
  5. Wrap up — summarise decisions, open questions

Key Vocabulary

TermDefinition
APIApplication Programming Interface — how systems talk to each other
API ContractDefines request format and expected response shape
SLAService Level Agreement — uptime/latency guarantee
SLOService Level Objective — internal target (e.g., 99.9% uptime)
ThroughputRequests per second the system can handle
LatencyTime from request to response
Availability% of time the system is operational (99.9% = 8.7 hrs downtime/yr)
ConsistencyAll nodes see the same data at the same time
Partition ToleranceSystem continues despite network splits (CAP theorem)

Quality Pillars

Scalability — Design for Growth

  • Ability to handle increasing load by adding resources.
  • Vertical scaling — bigger machine (more CPU/RAM). Simple but hits physical limits.
  • Horizontal scaling — more machines. How internet giants operate, but requires designing for distribution from the start.

Reliability — Embrace Failure

  • System continues to work correctly even when components fail — and they always will (drives, networks, power).
  • Achieved via replication, redundancy, and graceful degradation.
  • Goal: prevent failures from cascading into total collapse, not prevent all failures.

Availability — Minimize Downtime

  • % of time the system is operational. "Five nines" (99.999%) = 5.26 min downtime/year.
  • Requires: no single points of failure, load balancers to route around unhealthy instances, fast recovery.
  • Trades off with consistency — during a partition, you may serve stale data rather than return errors.

Maintainability — Design for Change

  • How easily engineers can understand, modify, and debug the system over time.
  • Requires: clear interfaces, good observability (logging, metrics, tracing), modular architecture.
  • Poor maintainability is why companies do expensive rewrites.

CAP Theorem

A distributed system can guarantee at most 2 of 3:

  • Consistency — every read gets the latest write
  • Availability — every request gets a response
  • Partition Tolerance — works despite network failures

In practice: P is non-negotiable → choose CP (banks) or AP (social media)

flowchart TD
    subgraph P["Network Partition Occurs"]
        Node1["Node 1 - Data v2"]
        Split["❌ Network Split <br/> Nodes cant sync"]
        Node2["Node 2 - Data v1"]
        Client["Client Request"]
        Node1 -.-> Split
        Split -.-> Node2
    end

    subgraph B["Choice B: Prioritize Availability"]
        CB_Client["Client"]
        CB_Node2["Node 2"]
        CB_Result["✓ High Availability <br/> ❌ Eventual Consistency <br/> Ex: Social media feeds"]
        CB_Client -->|Read request| CB_Node2
        CB_Node2 -->|Return v1 stale| CB_Client
    end

    subgraph A["Choice A: Prioritize Consistency"]
        CA_Client["Client"]
        CA_Node2["Node 2"]
        CA_Result["✓ Strong Consistency <br/> ❌ Lower Availability <br/> Ex: Banking systems"]
        CA_Client -->|Read request| CA_Node2
        CA_Node2 -->|Return error| CA_Client
    end

    Client -.->|Forces choice| CB_Client
    Client -.->|Forces choice| CA_Client

Key Tradeoffs

Consistency vs. Availability

Strong ConsistencyEventual Consistency
GuaranteeAll reads return latest writeReplicas converge eventually
LatencyHigher (coordination needed)Lower (serve any replica)
Use caseBanking, paymentsSocial feeds, caching, analytics

Latency vs. Throughput

Low LatencyHigh Throughput
GoalFast single requestMax requests/sec
TechniqueCaching, in-memory, fewer hopsBatching, queuing
Use caseGaming, video callsData pipelines, batch jobs

Sometimes you need both — separate concerns into different components (e.g., Netflix: low latency for playback, high throughput for encoding).


Scalability Evolution

flowchart TD
    subgraph M["1K Users: Monolith"]
        M1["Single Server <br/> App + DB"]
        M1_Users["~10 req/sec"]
    end

    subgraph V["10K Users: Vertical Scale + Cache"]
        V1["Web Server <br/> Bigger instance"]
        V2["Read Replica <br/> PostgreSQL"]
        V3["Primary DB <br/> PostgreSQL"]
        V4["Redis Cache"]
        V1 --> V2
        V1 --> V3
        V1 --> V4
    end

    subgraph H["100K Users: Horizontal Scale"]
        H1["Load Balancer"]
        H2["App Server 1"]
        H3["App Server 2"]
        H4["App Server 3"]
        H5["Redis Cluster"]
        H6["Primary DB"]
        H7["Read Replicas"]
        H8["CDN <br/> Static Assets"]
        H1 --> H2 & H3 & H4
        H2 & H3 & H4 --> H5 & H6 & H7
    end

    subgraph D["1M+ Users: Distributed Systems"]
        D1["Global CDN"]
        D2["API Gateway"]
        D3["Microservice 1"]
        D4["Microservice 2"]
        D5["Microservice 3"]
        D6["Message Queue <br/> Kafka"]
        D7["DB Shard 1"]
        D8["DB Shard 2"]
        D9["DB Shard 3"]
        D10["Distributed Cache <br/> Multi-region"]
        D1 --> D2
        D2 --> D3 & D4 & D5
        D3 & D4 & D5 --> D6
        D3 --> D7
        D4 --> D8
        D5 --> D9
        D3 & D4 & D5 --> D10
    end

    M1 -. "Scale up" .-> V1
    V1 -. "Scale out" .-> H1
    H1 -. "Distribute" .-> D1
ScalePatternKey Addition
1KMonolithSingle server
10KVertical scaleBigger machine + Read replica + Cache
100KHorizontal scaleLoad balancer + Multiple app servers + CDN
1M+DistributedMicroservices + Kafka + DB sharding + Multi-region cache

Common Pitfalls

PitfallCauseHow to Avoid
Premature optimisationDesigning for billions when you have thousandsStart simple, scale at actual limits. Instagram stayed monolith until millions of users
Ignoring operational complexityElegant whiteboard → nightmare in prodConsider: who's on-call? how to debug? blast radius? Discuss monitoring in interviews
Designing without requirementsJumping to solutions without constraintsSpend first 10 min clarifying: users, read/write ratio, latency needs, consistency needs