graph TD
Data[Big Data] --> V1[Volume<br/>How much data]
Data --> V2[Velocity<br/>How fast it arrives]
Data --> V3[Variety<br/>How many types]
style Data fill:#dbeafe,stroke:#3b82f6
style V1 fill:#dcfce7,stroke:#16a34a
style V2 fill:#fef9c3,stroke:#ca8a04
style V3 fill:#f3e8ff,stroke:#9333ea
Volume — size of data organizations deal with at any time
Velocity — speed at which new data is generated, collected, processed
Variety — different types, structures, and sources of data
Data Storage Architectures
Data Warehouse
Centralized repository optimized for analysis
Data is cleaned, transformed and loaded (ETL)
Typically uses star or snowflake schema
Optimized for read-heavy, complex queries
Examples: Amazon Redshift, Google BigQuery, Azure SQL DW
Data Lake
Stores vast amounts of raw data in native format
No predefined schema — schema applied on read
Supports batch, real-time, and stream processing
Examples: AWS S3, Azure Data Lake, HDFS
Data Lakehouse
Hybrid — combines best of Data Lake + Data Warehouse
Supports both structured and unstructured data
Schema on write AND schema on read
ACID transactions via Delta Lake
Examples: AWS Lake Formation (S3 + Redshift Spectrum), Delta Lake
Data Mesh
Coined 2019 — about governance and organization, not technology
Individual teams own data products within their domain
Domain-based data management
Federated governance with central standards
Self-service tooling and infra
Data lakes/warehouses/S3 can all be part of it
Warehouse vs Lake vs Lakehouse
graph LR
subgraph warehouse["Data Warehouse"]
W1[Structured only]
W2[Schema on Write]
W3[ETL]
W4[Fast complex queries]
W5[High cost]
end
subgraph lake["Data Lake"]
L1[All data types]
L2[Schema on Read]
L3[ELT or just store]
L4[Flexible & scalable]
L5[Low storage cost]
end
subgraph lakehouse["Data Lakehouse"]
LH1[All data types]
LH2[Both schemas]
LH3[ACID transactions]
LH4[Analytics + ML]
LH5[Best of both]
end
style warehouse fill:#fff7ed,stroke:#f97316
style lake fill:#f0fdf4,stroke:#22c55e
style lakehouse fill:#faf5ff,stroke:#a855f7
Data Warehouse
Data Lake
Data Lakehouse
Schema
Schema on Write (ETL)
Schema on Read (ELT)
Both
Data Type
Structured only
All types
All types
Agility
Less flexible
More flexible
Most flexible
Processing
ETL
ELT / just store
ETL + ELT
Cost
High (query optimization)
Low storage, high processing
Balanced
Use When
Structured data, fast analytics
Mixed data, massive scale
Need both analytics + ML
ETL Pipelines
graph LR
Sources[Data Sources<br/>DBs, APIs, Files] -->|Extract| E[Extract<br/>Raw Data]
E -->|Transform| T[Transform<br/>Clean, Enrich, Format]
T -->|Load| L[Load<br/>Data Warehouse / Lake]
style Sources fill:#dbeafe,stroke:#3b82f6
style E fill:#dcfce7,stroke:#16a34a
style T fill:#fef9c3,stroke:#ca8a04
style L fill:#f3e8ff,stroke:#9333ea
Extract
Retrieve raw data from source systems (DBs, CRMs, flat files, APIs)
Can be real-time or batch
Transform
Data cleansing — remove duplicates, fix errors
Data enrichment — add data from other sources
Format changes — date formatting, string manipulation
Aggregations — calculate totals, averages
Handle missing values, encoding/decoding
Load
Move transformed data to target warehouse/lake
Batch (all at once) or streaming (as data arrives)