Back to Notes

Data Engineering Fundamentals

Types of Data

TypeDefinitionExamples
StructuredOrganized in a defined schemaDB tables, CSV, Excel spreadsheets
Semi-StructuredSome structure via tags/hierarchiesJSON, XML, log files, email headers
UnstructuredNo predefined structureVideos, images, audio, emails, text files

Properties of Data — The 3 V's

graph TD
    Data[Big Data] --> V1[Volume<br/>How much data]
    Data --> V2[Velocity<br/>How fast it arrives]
    Data --> V3[Variety<br/>How many types]

    style Data fill:#dbeafe,stroke:#3b82f6
    style V1 fill:#dcfce7,stroke:#16a34a
    style V2 fill:#fef9c3,stroke:#ca8a04
    style V3 fill:#f3e8ff,stroke:#9333ea
  • Volume — size of data organizations deal with at any time
  • Velocity — speed at which new data is generated, collected, processed
  • Variety — different types, structures, and sources of data

Data Storage Architectures

Data Warehouse

  • Centralized repository optimized for analysis
  • Data is cleaned, transformed and loaded (ETL)
  • Typically uses star or snowflake schema
  • Optimized for read-heavy, complex queries
  • Examples: Amazon Redshift, Google BigQuery, Azure SQL DW

Data Lake

  • Stores vast amounts of raw data in native format
  • No predefined schema — schema applied on read
  • Supports batch, real-time, and stream processing
  • Examples: AWS S3, Azure Data Lake, HDFS

Data Lakehouse

  • Hybrid — combines best of Data Lake + Data Warehouse
  • Supports both structured and unstructured data
  • Schema on write AND schema on read
  • ACID transactions via Delta Lake
  • Examples: AWS Lake Formation (S3 + Redshift Spectrum), Delta Lake

Data Mesh

  • Coined 2019 — about governance and organization, not technology
  • Individual teams own data products within their domain
  • Domain-based data management
  • Federated governance with central standards
  • Self-service tooling and infra
  • Data lakes/warehouses/S3 can all be part of it

Warehouse vs Lake vs Lakehouse

graph LR
    subgraph warehouse["Data Warehouse"]
        W1[Structured only]
        W2[Schema on Write]
        W3[ETL]
        W4[Fast complex queries]
        W5[High cost]
    end

    subgraph lake["Data Lake"]
        L1[All data types]
        L2[Schema on Read]
        L3[ELT or just store]
        L4[Flexible & scalable]
        L5[Low storage cost]
    end

    subgraph lakehouse["Data Lakehouse"]
        LH1[All data types]
        LH2[Both schemas]
        LH3[ACID transactions]
        LH4[Analytics + ML]
        LH5[Best of both]
    end

    style warehouse fill:#fff7ed,stroke:#f97316
    style lake fill:#f0fdf4,stroke:#22c55e
    style lakehouse fill:#faf5ff,stroke:#a855f7
Data WarehouseData LakeData Lakehouse
SchemaSchema on Write (ETL)Schema on Read (ELT)Both
Data TypeStructured onlyAll typesAll types
AgilityLess flexibleMore flexibleMost flexible
ProcessingETLELT / just storeETL + ELT
CostHigh (query optimization)Low storage, high processingBalanced
Use WhenStructured data, fast analyticsMixed data, massive scaleNeed both analytics + ML

ETL Pipelines

graph LR
    Sources[Data Sources<br/>DBs, APIs, Files] -->|Extract| E[Extract<br/>Raw Data]
    E -->|Transform| T[Transform<br/>Clean, Enrich, Format]
    T -->|Load| L[Load<br/>Data Warehouse / Lake]

    style Sources fill:#dbeafe,stroke:#3b82f6
    style E fill:#dcfce7,stroke:#16a34a
    style T fill:#fef9c3,stroke:#ca8a04
    style L fill:#f3e8ff,stroke:#9333ea

Extract

  • Retrieve raw data from source systems (DBs, CRMs, flat files, APIs)
  • Can be real-time or batch

Transform

  • Data cleansing — remove duplicates, fix errors
  • Data enrichment — add data from other sources
  • Format changes — date formatting, string manipulation
  • Aggregations — calculate totals, averages
  • Handle missing values, encoding/decoding

Load

  • Move transformed data to target warehouse/lake
  • Batch (all at once) or streaming (as data arrives)

AWS ETL Services

ServiceUse
AWS GlueManaged ETL service, auto schema discovery
AWS Glue WorkflowsOrchestrate multi-step ETL jobs
EventBridgeEvent-driven pipeline triggers
AWS MWAAManaged Apache Airflow for complex orchestration
AWS LambdaLightweight, event-driven transforms

Data Sources

SourceFull NameNotes
JDBCJava Database ConnectivityPlatform independent, language dependent
ODBCOpen Database ConnectivityPlatform dependent, language independent
Raw LogsServer, app, access logs
APIsREST/GraphQL responses
StreamsKafka, Kinesis real-time feeds

Data Formats

graph TD
    Formats[Data Formats] --> Text[Text-Based]
    Formats --> Binary[Binary]

    Text --> CSV[CSV<br/>Tabular, human-readable<br/>Small-medium datasets]
    Text --> JSON[JSON<br/>Key-value, semi-structured<br/>APIs, web, NoSQL]

    Binary --> Avro[Avro<br/>Row-based + schema<br/>Kafka, streaming, schema evolution]
    Binary --> Parquet[Parquet<br/>Columnar, compressed<br/>Analytics, Spark, Redshift]

    style Formats fill:#dbeafe,stroke:#3b82f6
    style Text fill:#dcfce7,stroke:#16a34a
    style Binary fill:#fef9c3,stroke:#ca8a04
FormatTypeBest ForSystems
CSVText, rowSmall datasets, human-readable, import/exportExcel, Pandas, SQL DBs, ETL tools
JSONText, key-valueAPIs, web, flexible schema, nested dataREST APIs, MongoDB, JS/Python
AvroBinary, row + schemaStreaming, schema evolution, Kafka transportKafka, Spark, Flink, Hadoop
ParquetBinary, columnarAnalytics, reading specific columns, large datasetsSpark, Hive, Redshift Spectrum, Athena