Data Engineering Fundamentals | My Portfolio

Types of Data

Type	Definition	Examples
Structured	Organized in a defined schema	DB tables, CSV, Excel spreadsheets
Semi-Structured	Some structure via tags/hierarchies	JSON, XML, log files, email headers
Unstructured	No predefined structure	Videos, images, audio, emails, text files

Properties of Data — The 3 V's

graph TD
    Data[Big Data] --> V1[Volume<br/>How much data]
    Data --> V2[Velocity<br/>How fast it arrives]
    Data --> V3[Variety<br/>How many types]

    style Data fill:#dbeafe,stroke:#3b82f6
    style V1 fill:#dcfce7,stroke:#16a34a
    style V2 fill:#fef9c3,stroke:#ca8a04
    style V3 fill:#f3e8ff,stroke:#9333ea

Volume — size of data organizations deal with at any time
Velocity — speed at which new data is generated, collected, processed
Variety — different types, structures, and sources of data

Data Storage Architectures

Data Warehouse

Centralized repository optimized for analysis
Data is cleaned, transformed and loaded (ETL)
Typically uses star or snowflake schema
Optimized for read-heavy, complex queries
Examples: Amazon Redshift, Google BigQuery, Azure SQL DW

Data Lake

Stores vast amounts of raw data in native format
No predefined schema — schema applied on read
Supports batch, real-time, and stream processing
Examples: AWS S3, Azure Data Lake, HDFS

Data Lakehouse

Hybrid — combines best of Data Lake + Data Warehouse
Supports both structured and unstructured data
Schema on write AND schema on read
ACID transactions via Delta Lake
Examples: AWS Lake Formation (S3 + Redshift Spectrum), Delta Lake

Data Mesh

Coined 2019 — about governance and organization, not technology
Individual teams own data products within their domain
Domain-based data management
Federated governance with central standards
Self-service tooling and infra
Data lakes/warehouses/S3 can all be part of it

Warehouse vs Lake vs Lakehouse

graph LR
    subgraph warehouse["Data Warehouse"]
        W1[Structured only]
        W2[Schema on Write]
        W3[ETL]
        W4[Fast complex queries]
        W5[High cost]
    end

    subgraph lake["Data Lake"]
        L1[All data types]
        L2[Schema on Read]
        L3[ELT or just store]
        L4[Flexible & scalable]
        L5[Low storage cost]
    end

    subgraph lakehouse["Data Lakehouse"]
        LH1[All data types]
        LH2[Both schemas]
        LH3[ACID transactions]
        LH4[Analytics + ML]
        LH5[Best of both]
    end

    style warehouse fill:#fff7ed,stroke:#f97316
    style lake fill:#f0fdf4,stroke:#22c55e
    style lakehouse fill:#faf5ff,stroke:#a855f7

	Data Warehouse	Data Lake	Data Lakehouse
Schema	Schema on Write (ETL)	Schema on Read (ELT)	Both
Data Type	Structured only	All types	All types
Agility	Less flexible	More flexible	Most flexible
Processing	ETL	ELT / just store	ETL + ELT
Cost	High (query optimization)	Low storage, high processing	Balanced
Use When	Structured data, fast analytics	Mixed data, massive scale	Need both analytics + ML

ETL Pipelines

graph LR
    Sources[Data Sources<br/>DBs, APIs, Files] -->|Extract| E[Extract<br/>Raw Data]
    E -->|Transform| T[Transform<br/>Clean, Enrich, Format]
    T -->|Load| L[Load<br/>Data Warehouse / Lake]

    style Sources fill:#dbeafe,stroke:#3b82f6
    style E fill:#dcfce7,stroke:#16a34a
    style T fill:#fef9c3,stroke:#ca8a04
    style L fill:#f3e8ff,stroke:#9333ea

Extract

Retrieve raw data from source systems (DBs, CRMs, flat files, APIs)
Can be real-time or batch

Transform

Data cleansing — remove duplicates, fix errors
Data enrichment — add data from other sources
Format changes — date formatting, string manipulation
Aggregations — calculate totals, averages
Handle missing values, encoding/decoding

Load

Move transformed data to target warehouse/lake
Batch (all at once) or streaming (as data arrives)

AWS ETL Services

Service	Use
AWS Glue	Managed ETL service, auto schema discovery
AWS Glue Workflows	Orchestrate multi-step ETL jobs
EventBridge	Event-driven pipeline triggers
AWS MWAA	Managed Apache Airflow for complex orchestration
AWS Lambda	Lightweight, event-driven transforms

Data Sources

Source	Full Name	Notes
JDBC	Java Database Connectivity	Platform independent, language dependent
ODBC	Open Database Connectivity	Platform dependent, language independent
Raw Logs	—	Server, app, access logs
APIs	—	REST/GraphQL responses
Streams	—	Kafka, Kinesis real-time feeds

Data Formats

graph TD
    Formats[Data Formats] --> Text[Text-Based]
    Formats --> Binary[Binary]

    Text --> CSV[CSV<br/>Tabular, human-readable<br/>Small-medium datasets]
    Text --> JSON[JSON<br/>Key-value, semi-structured<br/>APIs, web, NoSQL]

    Binary --> Avro[Avro<br/>Row-based + schema<br/>Kafka, streaming, schema evolution]
    Binary --> Parquet[Parquet<br/>Columnar, compressed<br/>Analytics, Spark, Redshift]

    style Formats fill:#dbeafe,stroke:#3b82f6
    style Text fill:#dcfce7,stroke:#16a34a
    style Binary fill:#fef9c3,stroke:#ca8a04

Format	Type	Best For	Systems
CSV	Text, row	Small datasets, human-readable, import/export	Excel, Pandas, SQL DBs, ETL tools
JSON	Text, key-value	APIs, web, flexible schema, nested data	REST APIs, MongoDB, JS/Python
Avro	Binary, row + schema	Streaming, schema evolution, Kafka transport	Kafka, Spark, Flink, Hadoop
Parquet	Binary, columnar	Analytics, reading specific columns, large datasets	Spark, Hive, Redshift Spectrum, Athena