System Design Basics | My Portfolio

HLD Process Flow

flowchart TD
    Requirements["Requirements <br/> Functional & Non-Functional"]
    HLD["High-Level Design <br/> Architecture & Components"]
    DataModel["Data Model & Flow <br/> Entities & Interactions"]
    NFR["Non-Functional Requirements <br/> Quality Attributes"]

    subgraph Quality Pillars
        Scalability["Scalability <br/> Handle Growth"]
        Reliability["Reliability <br/> Correct Under Failure"]
        Availability["Availability <br/> Minimize Downtime"]
        Maintainability["Maintainability <br/> Easy to Change"]
    end

    Requirements --> HLD
    HLD --> DataModel
    DataModel --> NFR
    NFR --> Scalability
    NFR --> Reliability
    NFR --> Availability
    NFR --> Maintainability

    Scalability -. drives .-> HLD
    Reliability -. drives .-> HLD
    Availability -. drives .-> HLD
    Maintainability -. drives .-> HLD

Evaluation Criteria

Any system design solution should be evaluated on 3 criteria:

Simplicity — Is it as simple as it needs to be?
Fidelity — Does it cover all functional and non-functional requirements?
Cost Effectiveness — Is the infra spend justified?

The core principle of a Software Architect is to solve the business problem — not to build the most sophisticated system.

Interview Framework

Clarify requirements — functional (what it does) + non-functional (scale, latency, availability)
Estimate scale — QPS, storage, bandwidth
High-level design — components, data flow, APIs
Deep dive — bottlenecks, tradeoffs, failure handling
Wrap up — summarise decisions, open questions

Key Vocabulary

Term	Definition
API	Application Programming Interface — how systems talk to each other
API Contract	Defines request format and expected response shape
SLA	Service Level Agreement — uptime/latency guarantee
SLO	Service Level Objective — internal target (e.g., 99.9% uptime)
Throughput	Requests per second the system can handle
Latency	Time from request to response
Availability	% of time the system is operational (99.9% = 8.7 hrs downtime/yr)
Consistency	All nodes see the same data at the same time
Partition Tolerance	System continues despite network splits (CAP theorem)

Quality Pillars

Scalability — Design for Growth

Ability to handle increasing load by adding resources.
Vertical scaling — bigger machine (more CPU/RAM). Simple but hits physical limits.
Horizontal scaling — more machines. How internet giants operate, but requires designing for distribution from the start.

Reliability — Embrace Failure

System continues to work correctly even when components fail — and they always will (drives, networks, power).
Achieved via replication, redundancy, and graceful degradation.
Goal: prevent failures from cascading into total collapse, not prevent all failures.

Availability — Minimize Downtime

% of time the system is operational. "Five nines" (99.999%) = 5.26 min downtime/year.
Requires: no single points of failure, load balancers to route around unhealthy instances, fast recovery.
Trades off with consistency — during a partition, you may serve stale data rather than return errors.

Maintainability — Design for Change

How easily engineers can understand, modify, and debug the system over time.
Requires: clear interfaces, good observability (logging, metrics, tracing), modular architecture.
Poor maintainability is why companies do expensive rewrites.

CAP Theorem

A distributed system can guarantee at most 2 of 3:

Consistency — every read gets the latest write
Availability — every request gets a response
Partition Tolerance — works despite network failures

In practice: P is non-negotiable → choose CP (banks) or AP (social media)

flowchart TD
    subgraph P["Network Partition Occurs"]
        Node1["Node 1 - Data v2"]
        Split["❌ Network Split <br/> Nodes cant sync"]
        Node2["Node 2 - Data v1"]
        Client["Client Request"]
        Node1 -.-> Split
        Split -.-> Node2
    end

    subgraph B["Choice B: Prioritize Availability"]
        CB_Client["Client"]
        CB_Node2["Node 2"]
        CB_Result["✓ High Availability <br/> ❌ Eventual Consistency <br/> Ex: Social media feeds"]
        CB_Client -->|Read request| CB_Node2
        CB_Node2 -->|Return v1 stale| CB_Client
    end

    subgraph A["Choice A: Prioritize Consistency"]
        CA_Client["Client"]
        CA_Node2["Node 2"]
        CA_Result["✓ Strong Consistency <br/> ❌ Lower Availability <br/> Ex: Banking systems"]
        CA_Client -->|Read request| CA_Node2
        CA_Node2 -->|Return error| CA_Client
    end

    Client -.->|Forces choice| CB_Client
    Client -.->|Forces choice| CA_Client

Key Tradeoffs

Consistency vs. Availability

	Strong Consistency	Eventual Consistency
Guarantee	All reads return latest write	Replicas converge eventually
Latency	Higher (coordination needed)	Lower (serve any replica)
Use case	Banking, payments	Social feeds, caching, analytics

Latency vs. Throughput

	Low Latency	High Throughput
Goal	Fast single request	Max requests/sec
Technique	Caching, in-memory, fewer hops	Batching, queuing
Use case	Gaming, video calls	Data pipelines, batch jobs

Sometimes you need both — separate concerns into different components (e.g., Netflix: low latency for playback, high throughput for encoding).

Scalability Evolution

flowchart TD
    subgraph M["1K Users: Monolith"]
        M1["Single Server <br/> App + DB"]
        M1_Users["~10 req/sec"]
    end

    subgraph V["10K Users: Vertical Scale + Cache"]
        V1["Web Server <br/> Bigger instance"]
        V2["Read Replica <br/> PostgreSQL"]
        V3["Primary DB <br/> PostgreSQL"]
        V4["Redis Cache"]
        V1 --> V2
        V1 --> V3
        V1 --> V4
    end

    subgraph H["100K Users: Horizontal Scale"]
        H1["Load Balancer"]
        H2["App Server 1"]
        H3["App Server 2"]
        H4["App Server 3"]
        H5["Redis Cluster"]
        H6["Primary DB"]
        H7["Read Replicas"]
        H8["CDN <br/> Static Assets"]
        H1 --> H2 & H3 & H4
        H2 & H3 & H4 --> H5 & H6 & H7
    end

    subgraph D["1M+ Users: Distributed Systems"]
        D1["Global CDN"]
        D2["API Gateway"]
        D3["Microservice 1"]
        D4["Microservice 2"]
        D5["Microservice 3"]
        D6["Message Queue <br/> Kafka"]
        D7["DB Shard 1"]
        D8["DB Shard 2"]
        D9["DB Shard 3"]
        D10["Distributed Cache <br/> Multi-region"]
        D1 --> D2
        D2 --> D3 & D4 & D5
        D3 & D4 & D5 --> D6
        D3 --> D7
        D4 --> D8
        D5 --> D9
        D3 & D4 & D5 --> D10
    end

    M1 -. "Scale up" .-> V1
    V1 -. "Scale out" .-> H1
    H1 -. "Distribute" .-> D1

Scale	Pattern	Key Addition
1K	Monolith	Single server
10K	Vertical scale	Bigger machine + Read replica + Cache
100K	Horizontal scale	Load balancer + Multiple app servers + CDN
1M+	Distributed	Microservices + Kafka + DB sharding + Multi-region cache

Common Pitfalls

Pitfall	Cause	How to Avoid
Premature optimisation	Designing for billions when you have thousands	Start simple, scale at actual limits. Instagram stayed monolith until millions of users
Ignoring operational complexity	Elegant whiteboard → nightmare in prod	Consider: who's on-call? how to debug? blast radius? Discuss monitoring in interviews
Designing without requirements	Jumping to solutions without constraints	Spend first 10 min clarifying: users, read/write ratio, latency needs, consistency needs