Domain 1 — Data Preparation for ML (28%)

Largest domain. Focus on: S3, Glue, Athena, Kinesis, SageMaker Ground Truth, Feature Store, and feature engineering techniques.

Data Storage Options

graph LR
    Raw[Raw Data] --> S3[S3<br/>Data Lake]
    S3 --> Glue[Glue ETL<br/>Transform]
    Glue --> S3P[S3 Processed]
    S3P --> RS[Redshift<br/>Structured / DW]
    S3P --> Athena[Athena<br/>SQL Query]
    S3P --> SM[SageMaker<br/>Training]

    Stream[Streaming] --> KDS[Kinesis<br/>Data Streams]
    KDS --> KDF[Kinesis<br/>Firehose]
    KDF --> S3

    style S3 fill:#fef9c3,stroke:#ca8a04
    style SM fill:#dbeafe,stroke:#3b82f6
    style Glue fill:#dcfce7,stroke:#16a34a

S3 for ML

Primary data lake for raw, processed, and model artefacts
S3 prefixes act as folders — organise: s3://bucket/raw/, /processed/, /features/, /models/
Use S3 Intelligent-Tiering for infrequent access training data
S3 Select — retrieve specific columns/rows from CSV/JSON/Parquet without loading full file (saves cost + time)
Requester Pays — useful when sharing datasets across accounts

Amazon Redshift

Columnar data warehouse — fast analytical queries
Redshift ML — create ML models using SQL (CREATE MODEL statement) — runs SageMaker under the hood
Redshift Spectrum — query data directly in S3 without loading into Redshift
Use for: structured data, BI dashboards, large-scale SQL analytics

Athena

Serverless SQL on S3 — no infrastructure, pay per query
Supports: CSV, JSON, Parquet, ORC, Avro
Integrates with Glue Data Catalog for schema discovery
Parquet/ORC recommended — columnar, compressed, much faster + cheaper than CSV
Use for: ad-hoc exploration, quick EDA on S3 data

AWS Glue

The primary ETL service for ML data preparation.

Components

Component	Purpose
Glue Data Catalog	Centralised metadata store — tables, schemas, partitions
Glue Crawlers	Auto-discover schema from S3/RDS/Redshift, populate Data Catalog
Glue ETL Jobs	Spark-based (PySpark/Scala) or Python shell scripts
Glue DataBrew	No-code visual data preparation and cleaning
Glue Studio	Visual ETL pipeline builder
Glue Streaming	Real-time ETL on Kinesis / Kafka streams

Glue ETL — Common Transformations for ML

# Example Glue PySpark ETL
from awsglue.transforms import *

# Drop null rows
df = df.dropna()

# Filter
df = df.filter(df.age > 18)

# Rename column
df = df.withColumnRenamed("old_name", "new_name")

# Write to S3 in Parquet
glueContext.write_dynamic_frame.from_options(
    frame=df,
    connection_type="s3",
    format="parquet",
    connection_options={"path": "s3://bucket/processed/"}
)

Glue DataBrew

No-code visual tool for data cleaning
250+ built-in transformations (normalise, deduplicate, fix data types, fill missing)
Profiles data — auto-detects issues (missing values, outliers, type mismatches)
Publishes cleaned data to S3
Use when: data analysts/domain experts (non-engineers) need to clean data

Amazon Kinesis — Streaming Data for ML

Service	Purpose	Retention
Kinesis Data Streams	Real-time data ingestion, custom consumers	1–365 days
Kinesis Data Firehose	Managed delivery to S3/Redshift/OpenSearch	No retention (near-real-time)
Kinesis Data Analytics	SQL/Apache Flink on streams	In-flight processing
MSK (Kafka)	Managed Kafka — alternative to KDS	Configurable

Kinesis Data Streams — Key Facts

Data split into shards
1 shard = 1 MB/s write (1,000 records/sec), 2 MB/s read
Enhanced Fan-Out — 2 MB/s per consumer per shard (parallel consumers)
Partition key determines which shard a record goes to — use high-cardinality keys to avoid hot shards
KCL (Kinesis Client Library) — manages shard reading, checkpointing

ML with Streaming

IoT / Clickstream / Logs
        ↓
Kinesis Data Streams
        ↓
Lambda / Kinesis Analytics  ← real-time feature computation
        ↓
SageMaker Endpoint          ← real-time inference
        ↓
Kinesis Firehose → S3       ← store for future retraining

Data Labelling — SageMaker Ground Truth

Managed data labelling service — human labellers + ML-assisted labelling
Integrates with: Amazon Mechanical Turk, private workforce, AWS Marketplace vendors

Workflow

Raw Data (S3) → Labelling Job → Human Workers → Labelled Dataset (S3)

Label Types Supported

Image classification, bounding boxes, semantic segmentation, keypoints
Text classification, named entity recognition
Video classification, object tracking

Automated Data Labelling

Active Learning loop — model learns from confident labels, sends uncertain ones to humans
Reduces labelling cost by up to 70% (humans only label ambiguous samples)
Consolidation algorithm (Expectation Maximisation) — combines multiple worker annotations

Ground Truth Plus

Fully managed labelling with AWS-managed workforce
Higher quality, less setup — use when you don't want to manage workforce yourself

SageMaker Feature Store

Central repository to store, discover, and share ML features.

graph LR
    Data[Raw Data] --> FE[Feature Engineering<br/>Code]
    FE --> FS[Feature Store]
    FS --> Online[Online Store<br/>Low-latency serving<br/>Real-time inference]
    FS --> Offline[Offline Store<br/>S3-backed<br/>Training]

    style FS fill:#dbeafe,stroke:#3b82f6
    style Online fill:#dcfce7,stroke:#16a34a
    style Offline fill:#fef9c3,stroke:#ca8a04

	Online Store	Offline Store
Backend	DynamoDB	S3 (Parquet)
Latency	Single-digit ms	Minutes
Use for	Real-time inference	Model training
Consistency	Eventually consistent	Point-in-time correct

Key Concepts

Feature Group — a collection of related features (like a table)
Record Identifier — unique key per entity (customer_id, product_id)
Event Time — timestamp of when feature value was valid → enables point-in-time lookups (prevents training-serving skew)
Point-in-time correctness — when creating training dataset, only features available at prediction time are used → prevents leakage

Why Feature Store?

Eliminates duplicate feature computation across teams
Eliminates training-serving skew (same features used in training and inference)
Time-travel queries — get historical feature values for a given timestamp

Data Quality & Exploration

SageMaker Data Wrangler

Visual data preparation tool inside SageMaker Studio
Connect to: S3, Athena, Redshift, EMR, Feature Store
300+ built-in transforms
Auto generates feature importance, data insights report
Exports to: SageMaker Pipelines, Jupyter Notebook, Feature Store, Training Job

SageMaker Clarify — Data Bias Detection

Detects pre-training bias in datasets (before model training)
Bias metrics:
- Class Imbalance (CI) — imbalanced representation of groups
- Difference in Proportions of Labels (DPL) — different positive label rates across groups
- Kullback-Leibler (KL) divergence between distributions
Generates bias report → guides whether to resample, reweight data

Data Ingestion Patterns

Batch Ingestion

Source (DB/Files) → S3 → Glue ETL → S3 Processed → SageMaker Training

Use: historical data, periodic retraining
Tools: Glue, EMR, AWS DMS (Database Migration Service)

Streaming Ingestion

Source → Kinesis Data Streams → Lambda/KDA → Feature Store → SageMaker Endpoint

Use: real-time features, online learning
Tools: Kinesis, MSK, Lambda, Flink

Hybrid (Lambda Architecture)

Batch Layer:    S3 → Glue → Offline Feature Store → Training
Speed Layer:    Kinesis → Lambda → Online Feature Store → Inference
Serving Layer:  Merge batch + streaming features → Prediction

EMR for ML Data Preparation

Managed Hadoop / Spark cluster
Use for: very large datasets (TBs), custom Spark transformations, Spark MLlib
Supports: Spark, Hive, Presto, HBase, Flink
EMR vs Glue:
- Glue = fully managed, serverless, simpler
- EMR = more control, custom libraries, very large scale

Data Formats for ML

Format	Best For	Notes
CSV	Simple, human-readable	Slow, large, no compression
JSON	Semi-structured	Verbose, slow
Parquet	Analytics, ML	Columnar, compressed, fast for queries
ORC	Hive workloads	Columnar, good for Hive/EMR
TFRecord	TensorFlow	Binary, efficient for TF training
RecordIO	MXNet	Binary, SageMaker built-in algorithms
LibSVM	Sparse data, XGBoost	Efficient for sparse features

SageMaker built-in algorithms often prefer RecordIO or CSV. XGBoost accepts CSV and LibSVM. Always check the algorithm input format in the docs.

Data Pipeline Orchestration

AWS Step Functions

Visual workflow orchestration — chain Lambda, Glue, SageMaker steps
Good for: complex multi-step pipelines with branching logic

Amazon MWAA (Managed Airflow)

Managed Apache Airflow — Python DAG-based orchestration
Use for: complex data engineering pipelines, team already using Airflow

EventBridge + Lambda

Event-driven triggers — run pipeline when new data lands in S3
Lightweight, serverless

Domain 1 — Exam Scenarios

Scenario	Answer
Discover schema of S3 data automatically	Glue Crawler
Run SQL on S3 without loading to DB	Athena
No-code data cleaning for domain expert	Glue DataBrew
Real-time feature serving for inference	Feature Store — Online Store
Training dataset from historical features	Feature Store — Offline Store
Prevent training-serving skew	Feature Store with event time
Label 10,000 images cheaply	Ground Truth with Active Learning
Stream IoT data to S3 for training	Kinesis Firehose → S3
Real-time streaming ML pipeline	Kinesis Data Streams → Lambda → SageMaker Endpoint
Large-scale Spark ETL with custom libs	EMR
Managed ETL, serverless, simple	Glue
Detect bias in training data	SageMaker Clarify (pre-training)
Visual ML data prep inside Studio	SageMaker Data Wrangler
Fast columnar file format for training	Parquet
Point-in-time correct training dataset	Feature Store with event time