Back to Notes

Domain 1 — Data Preparation for ML (28%)

Domain 1 — Data Preparation for ML (28%)

Largest domain. Focus on: S3, Glue, Athena, Kinesis, SageMaker Ground Truth, Feature Store, and feature engineering techniques.


Data Storage Options

graph LR
    Raw[Raw Data] --> S3[S3<br/>Data Lake]
    S3 --> Glue[Glue ETL<br/>Transform]
    Glue --> S3P[S3 Processed]
    S3P --> RS[Redshift<br/>Structured / DW]
    S3P --> Athena[Athena<br/>SQL Query]
    S3P --> SM[SageMaker<br/>Training]

    Stream[Streaming] --> KDS[Kinesis<br/>Data Streams]
    KDS --> KDF[Kinesis<br/>Firehose]
    KDF --> S3

    style S3 fill:#fef9c3,stroke:#ca8a04
    style SM fill:#dbeafe,stroke:#3b82f6
    style Glue fill:#dcfce7,stroke:#16a34a

S3 for ML

  • Primary data lake for raw, processed, and model artefacts
  • S3 prefixes act as folders — organise: s3://bucket/raw/, /processed/, /features/, /models/
  • Use S3 Intelligent-Tiering for infrequent access training data
  • S3 Select — retrieve specific columns/rows from CSV/JSON/Parquet without loading full file (saves cost + time)
  • Requester Pays — useful when sharing datasets across accounts

Amazon Redshift

  • Columnar data warehouse — fast analytical queries
  • Redshift ML — create ML models using SQL (CREATE MODEL statement) — runs SageMaker under the hood
  • Redshift Spectrum — query data directly in S3 without loading into Redshift
  • Use for: structured data, BI dashboards, large-scale SQL analytics

Athena

  • Serverless SQL on S3 — no infrastructure, pay per query
  • Supports: CSV, JSON, Parquet, ORC, Avro
  • Integrates with Glue Data Catalog for schema discovery
  • Parquet/ORC recommended — columnar, compressed, much faster + cheaper than CSV
  • Use for: ad-hoc exploration, quick EDA on S3 data

AWS Glue

The primary ETL service for ML data preparation.

Components

ComponentPurpose
Glue Data CatalogCentralised metadata store — tables, schemas, partitions
Glue CrawlersAuto-discover schema from S3/RDS/Redshift, populate Data Catalog
Glue ETL JobsSpark-based (PySpark/Scala) or Python shell scripts
Glue DataBrewNo-code visual data preparation and cleaning
Glue StudioVisual ETL pipeline builder
Glue StreamingReal-time ETL on Kinesis / Kafka streams

Glue ETL — Common Transformations for ML

# Example Glue PySpark ETL
from awsglue.transforms import *

# Drop null rows
df = df.dropna()

# Filter
df = df.filter(df.age > 18)

# Rename column
df = df.withColumnRenamed("old_name", "new_name")

# Write to S3 in Parquet
glueContext.write_dynamic_frame.from_options(
    frame=df,
    connection_type="s3",
    format="parquet",
    connection_options={"path": "s3://bucket/processed/"}
)

Glue DataBrew

  • No-code visual tool for data cleaning
  • 250+ built-in transformations (normalise, deduplicate, fix data types, fill missing)
  • Profiles data — auto-detects issues (missing values, outliers, type mismatches)
  • Publishes cleaned data to S3
  • Use when: data analysts/domain experts (non-engineers) need to clean data

Amazon Kinesis — Streaming Data for ML

ServicePurposeRetention
Kinesis Data StreamsReal-time data ingestion, custom consumers1–365 days
Kinesis Data FirehoseManaged delivery to S3/Redshift/OpenSearchNo retention (near-real-time)
Kinesis Data AnalyticsSQL/Apache Flink on streamsIn-flight processing
MSK (Kafka)Managed Kafka — alternative to KDSConfigurable

Kinesis Data Streams — Key Facts

  • Data split into shards
  • 1 shard = 1 MB/s write (1,000 records/sec), 2 MB/s read
  • Enhanced Fan-Out — 2 MB/s per consumer per shard (parallel consumers)
  • Partition key determines which shard a record goes to — use high-cardinality keys to avoid hot shards
  • KCL (Kinesis Client Library) — manages shard reading, checkpointing

ML with Streaming

IoT / Clickstream / Logs
        ↓
Kinesis Data Streams
        ↓
Lambda / Kinesis Analytics  ← real-time feature computation
        ↓
SageMaker Endpoint          ← real-time inference
        ↓
Kinesis Firehose → S3       ← store for future retraining

Data Labelling — SageMaker Ground Truth

  • Managed data labelling service — human labellers + ML-assisted labelling
  • Integrates with: Amazon Mechanical Turk, private workforce, AWS Marketplace vendors

Workflow

Raw Data (S3) → Labelling Job → Human Workers → Labelled Dataset (S3)

Label Types Supported

  • Image classification, bounding boxes, semantic segmentation, keypoints
  • Text classification, named entity recognition
  • Video classification, object tracking

Automated Data Labelling

  • Active Learning loop — model learns from confident labels, sends uncertain ones to humans
  • Reduces labelling cost by up to 70% (humans only label ambiguous samples)
  • Consolidation algorithm (Expectation Maximisation) — combines multiple worker annotations

Ground Truth Plus

  • Fully managed labelling with AWS-managed workforce
  • Higher quality, less setup — use when you don't want to manage workforce yourself

SageMaker Feature Store

Central repository to store, discover, and share ML features.

graph LR
    Data[Raw Data] --> FE[Feature Engineering<br/>Code]
    FE --> FS[Feature Store]
    FS --> Online[Online Store<br/>Low-latency serving<br/>Real-time inference]
    FS --> Offline[Offline Store<br/>S3-backed<br/>Training]

    style FS fill:#dbeafe,stroke:#3b82f6
    style Online fill:#dcfce7,stroke:#16a34a
    style Offline fill:#fef9c3,stroke:#ca8a04
Online StoreOffline Store
BackendDynamoDBS3 (Parquet)
LatencySingle-digit msMinutes
Use forReal-time inferenceModel training
ConsistencyEventually consistentPoint-in-time correct

Key Concepts

  • Feature Group — a collection of related features (like a table)
  • Record Identifier — unique key per entity (customer_id, product_id)
  • Event Time — timestamp of when feature value was valid → enables point-in-time lookups (prevents training-serving skew)
  • Point-in-time correctness — when creating training dataset, only features available at prediction time are used → prevents leakage

Why Feature Store?

  • Eliminates duplicate feature computation across teams
  • Eliminates training-serving skew (same features used in training and inference)
  • Time-travel queries — get historical feature values for a given timestamp

Data Quality & Exploration

SageMaker Data Wrangler

  • Visual data preparation tool inside SageMaker Studio
  • Connect to: S3, Athena, Redshift, EMR, Feature Store
  • 300+ built-in transforms
  • Auto generates feature importance, data insights report
  • Exports to: SageMaker Pipelines, Jupyter Notebook, Feature Store, Training Job

SageMaker Clarify — Data Bias Detection

  • Detects pre-training bias in datasets (before model training)
  • Bias metrics:
    • Class Imbalance (CI) — imbalanced representation of groups
    • Difference in Proportions of Labels (DPL) — different positive label rates across groups
    • Kullback-Leibler (KL) divergence between distributions
  • Generates bias report → guides whether to resample, reweight data

Data Ingestion Patterns

Batch Ingestion

Source (DB/Files) → S3 → Glue ETL → S3 Processed → SageMaker Training
  • Use: historical data, periodic retraining
  • Tools: Glue, EMR, AWS DMS (Database Migration Service)

Streaming Ingestion

Source → Kinesis Data Streams → Lambda/KDA → Feature Store → SageMaker Endpoint
  • Use: real-time features, online learning
  • Tools: Kinesis, MSK, Lambda, Flink

Hybrid (Lambda Architecture)

Batch Layer:    S3 → Glue → Offline Feature Store → Training
Speed Layer:    Kinesis → Lambda → Online Feature Store → Inference
Serving Layer:  Merge batch + streaming features → Prediction

EMR for ML Data Preparation

  • Managed Hadoop / Spark cluster
  • Use for: very large datasets (TBs), custom Spark transformations, Spark MLlib
  • Supports: Spark, Hive, Presto, HBase, Flink
  • EMR vs Glue:
    • Glue = fully managed, serverless, simpler
    • EMR = more control, custom libraries, very large scale

Data Formats for ML

FormatBest ForNotes
CSVSimple, human-readableSlow, large, no compression
JSONSemi-structuredVerbose, slow
ParquetAnalytics, MLColumnar, compressed, fast for queries
ORCHive workloadsColumnar, good for Hive/EMR
TFRecordTensorFlowBinary, efficient for TF training
RecordIOMXNetBinary, SageMaker built-in algorithms
LibSVMSparse data, XGBoostEfficient for sparse features

SageMaker built-in algorithms often prefer RecordIO or CSV. XGBoost accepts CSV and LibSVM. Always check the algorithm input format in the docs.


Data Pipeline Orchestration

AWS Step Functions

  • Visual workflow orchestration — chain Lambda, Glue, SageMaker steps
  • Good for: complex multi-step pipelines with branching logic

Amazon MWAA (Managed Airflow)

  • Managed Apache Airflow — Python DAG-based orchestration
  • Use for: complex data engineering pipelines, team already using Airflow

EventBridge + Lambda

  • Event-driven triggers — run pipeline when new data lands in S3
  • Lightweight, serverless

Domain 1 — Exam Scenarios

ScenarioAnswer
Discover schema of S3 data automaticallyGlue Crawler
Run SQL on S3 without loading to DBAthena
No-code data cleaning for domain expertGlue DataBrew
Real-time feature serving for inferenceFeature Store — Online Store
Training dataset from historical featuresFeature Store — Offline Store
Prevent training-serving skewFeature Store with event time
Label 10,000 images cheaplyGround Truth with Active Learning
Stream IoT data to S3 for trainingKinesis Firehose → S3
Real-time streaming ML pipelineKinesis Data Streams → Lambda → SageMaker Endpoint
Large-scale Spark ETL with custom libsEMR
Managed ETL, serverless, simpleGlue
Detect bias in training dataSageMaker Clarify (pre-training)
Visual ML data prep inside StudioSageMaker Data Wrangler
Fast columnar file format for trainingParquet
Point-in-time correct training datasetFeature Store with event time