Domain 1 — Data Preparation for ML (28%)
Largest domain. Focus on: S3, Glue, Athena, Kinesis, SageMaker Ground Truth, Feature Store, and feature engineering techniques.
Data Storage Options
graph LR
Raw[Raw Data] --> S3[S3<br/>Data Lake]
S3 --> Glue[Glue ETL<br/>Transform]
Glue --> S3P[S3 Processed]
S3P --> RS[Redshift<br/>Structured / DW]
S3P --> Athena[Athena<br/>SQL Query]
S3P --> SM[SageMaker<br/>Training]
Stream[Streaming] --> KDS[Kinesis<br/>Data Streams]
KDS --> KDF[Kinesis<br/>Firehose]
KDF --> S3
style S3 fill:#fef9c3,stroke:#ca8a04
style SM fill:#dbeafe,stroke:#3b82f6
style Glue fill:#dcfce7,stroke:#16a34a
S3 for ML
- Primary data lake for raw, processed, and model artefacts
- S3 prefixes act as folders — organise:
s3://bucket/raw/, /processed/, /features/, /models/
- Use S3 Intelligent-Tiering for infrequent access training data
- S3 Select — retrieve specific columns/rows from CSV/JSON/Parquet without loading full file (saves cost + time)
- Requester Pays — useful when sharing datasets across accounts
Amazon Redshift
- Columnar data warehouse — fast analytical queries
- Redshift ML — create ML models using SQL (
CREATE MODEL statement) — runs SageMaker under the hood
- Redshift Spectrum — query data directly in S3 without loading into Redshift
- Use for: structured data, BI dashboards, large-scale SQL analytics
Athena
- Serverless SQL on S3 — no infrastructure, pay per query
- Supports: CSV, JSON, Parquet, ORC, Avro
- Integrates with Glue Data Catalog for schema discovery
- Parquet/ORC recommended — columnar, compressed, much faster + cheaper than CSV
- Use for: ad-hoc exploration, quick EDA on S3 data
AWS Glue
The primary ETL service for ML data preparation.
Components
| Component | Purpose |
|---|
| Glue Data Catalog | Centralised metadata store — tables, schemas, partitions |
| Glue Crawlers | Auto-discover schema from S3/RDS/Redshift, populate Data Catalog |
| Glue ETL Jobs | Spark-based (PySpark/Scala) or Python shell scripts |
| Glue DataBrew | No-code visual data preparation and cleaning |
| Glue Studio | Visual ETL pipeline builder |
| Glue Streaming | Real-time ETL on Kinesis / Kafka streams |
Glue ETL — Common Transformations for ML
from awsglue.transforms import *
df = df.dropna()
df = df.filter(df.age > 18)
df = df.withColumnRenamed("old_name", "new_name")
glueContext.write_dynamic_frame.from_options(
frame=df,
connection_type="s3",
format="parquet",
connection_options={"path": "s3://bucket/processed/"}
)
Glue DataBrew
- No-code visual tool for data cleaning
- 250+ built-in transformations (normalise, deduplicate, fix data types, fill missing)
- Profiles data — auto-detects issues (missing values, outliers, type mismatches)
- Publishes cleaned data to S3
- Use when: data analysts/domain experts (non-engineers) need to clean data
Amazon Kinesis — Streaming Data for ML
| Service | Purpose | Retention |
|---|
| Kinesis Data Streams | Real-time data ingestion, custom consumers | 1–365 days |
| Kinesis Data Firehose | Managed delivery to S3/Redshift/OpenSearch | No retention (near-real-time) |
| Kinesis Data Analytics | SQL/Apache Flink on streams | In-flight processing |
| MSK (Kafka) | Managed Kafka — alternative to KDS | Configurable |
Kinesis Data Streams — Key Facts
- Data split into shards
- 1 shard = 1 MB/s write (1,000 records/sec), 2 MB/s read
- Enhanced Fan-Out — 2 MB/s per consumer per shard (parallel consumers)
- Partition key determines which shard a record goes to — use high-cardinality keys to avoid hot shards
- KCL (Kinesis Client Library) — manages shard reading, checkpointing
ML with Streaming
IoT / Clickstream / Logs
↓
Kinesis Data Streams
↓
Lambda / Kinesis Analytics ← real-time feature computation
↓
SageMaker Endpoint ← real-time inference
↓
Kinesis Firehose → S3 ← store for future retraining
Data Labelling — SageMaker Ground Truth
- Managed data labelling service — human labellers + ML-assisted labelling
- Integrates with: Amazon Mechanical Turk, private workforce, AWS Marketplace vendors
Workflow
Raw Data (S3) → Labelling Job → Human Workers → Labelled Dataset (S3)
Label Types Supported
- Image classification, bounding boxes, semantic segmentation, keypoints
- Text classification, named entity recognition
- Video classification, object tracking
Automated Data Labelling
- Active Learning loop — model learns from confident labels, sends uncertain ones to humans
- Reduces labelling cost by up to 70% (humans only label ambiguous samples)
- Consolidation algorithm (Expectation Maximisation) — combines multiple worker annotations
Ground Truth Plus
- Fully managed labelling with AWS-managed workforce
- Higher quality, less setup — use when you don't want to manage workforce yourself
SageMaker Feature Store
Central repository to store, discover, and share ML features.
graph LR
Data[Raw Data] --> FE[Feature Engineering<br/>Code]
FE --> FS[Feature Store]
FS --> Online[Online Store<br/>Low-latency serving<br/>Real-time inference]
FS --> Offline[Offline Store<br/>S3-backed<br/>Training]
style FS fill:#dbeafe,stroke:#3b82f6
style Online fill:#dcfce7,stroke:#16a34a
style Offline fill:#fef9c3,stroke:#ca8a04
| Online Store | Offline Store |
|---|
| Backend | DynamoDB | S3 (Parquet) |
| Latency | Single-digit ms | Minutes |
| Use for | Real-time inference | Model training |
| Consistency | Eventually consistent | Point-in-time correct |
Key Concepts
- Feature Group — a collection of related features (like a table)
- Record Identifier — unique key per entity (customer_id, product_id)
- Event Time — timestamp of when feature value was valid → enables point-in-time lookups (prevents training-serving skew)
- Point-in-time correctness — when creating training dataset, only features available at prediction time are used → prevents leakage
Why Feature Store?
- Eliminates duplicate feature computation across teams
- Eliminates training-serving skew (same features used in training and inference)
- Time-travel queries — get historical feature values for a given timestamp
Data Quality & Exploration
SageMaker Data Wrangler
- Visual data preparation tool inside SageMaker Studio
- Connect to: S3, Athena, Redshift, EMR, Feature Store
- 300+ built-in transforms
- Auto generates feature importance, data insights report
- Exports to: SageMaker Pipelines, Jupyter Notebook, Feature Store, Training Job
SageMaker Clarify — Data Bias Detection
- Detects pre-training bias in datasets (before model training)
- Bias metrics:
- Class Imbalance (CI) — imbalanced representation of groups
- Difference in Proportions of Labels (DPL) — different positive label rates across groups
- Kullback-Leibler (KL) divergence between distributions
- Generates bias report → guides whether to resample, reweight data
Data Ingestion Patterns
Batch Ingestion
Source (DB/Files) → S3 → Glue ETL → S3 Processed → SageMaker Training
- Use: historical data, periodic retraining
- Tools: Glue, EMR, AWS DMS (Database Migration Service)
Streaming Ingestion
Source → Kinesis Data Streams → Lambda/KDA → Feature Store → SageMaker Endpoint
- Use: real-time features, online learning
- Tools: Kinesis, MSK, Lambda, Flink
Hybrid (Lambda Architecture)
Batch Layer: S3 → Glue → Offline Feature Store → Training
Speed Layer: Kinesis → Lambda → Online Feature Store → Inference
Serving Layer: Merge batch + streaming features → Prediction
EMR for ML Data Preparation
- Managed Hadoop / Spark cluster
- Use for: very large datasets (TBs), custom Spark transformations, Spark MLlib
- Supports: Spark, Hive, Presto, HBase, Flink
- EMR vs Glue:
- Glue = fully managed, serverless, simpler
- EMR = more control, custom libraries, very large scale
Data Formats for ML
| Format | Best For | Notes |
|---|
| CSV | Simple, human-readable | Slow, large, no compression |
| JSON | Semi-structured | Verbose, slow |
| Parquet | Analytics, ML | Columnar, compressed, fast for queries |
| ORC | Hive workloads | Columnar, good for Hive/EMR |
| TFRecord | TensorFlow | Binary, efficient for TF training |
| RecordIO | MXNet | Binary, SageMaker built-in algorithms |
| LibSVM | Sparse data, XGBoost | Efficient for sparse features |
SageMaker built-in algorithms often prefer RecordIO or CSV. XGBoost accepts CSV and LibSVM. Always check the algorithm input format in the docs.
Data Pipeline Orchestration
AWS Step Functions
- Visual workflow orchestration — chain Lambda, Glue, SageMaker steps
- Good for: complex multi-step pipelines with branching logic
Amazon MWAA (Managed Airflow)
- Managed Apache Airflow — Python DAG-based orchestration
- Use for: complex data engineering pipelines, team already using Airflow
EventBridge + Lambda
- Event-driven triggers — run pipeline when new data lands in S3
- Lightweight, serverless
Domain 1 — Exam Scenarios
| Scenario | Answer |
|---|
| Discover schema of S3 data automatically | Glue Crawler |
| Run SQL on S3 without loading to DB | Athena |
| No-code data cleaning for domain expert | Glue DataBrew |
| Real-time feature serving for inference | Feature Store — Online Store |
| Training dataset from historical features | Feature Store — Offline Store |
| Prevent training-serving skew | Feature Store with event time |
| Label 10,000 images cheaply | Ground Truth with Active Learning |
| Stream IoT data to S3 for training | Kinesis Firehose → S3 |
| Real-time streaming ML pipeline | Kinesis Data Streams → Lambda → SageMaker Endpoint |
| Large-scale Spark ETL with custom libs | EMR |
| Managed ETL, serverless, simple | Glue |
| Detect bias in training data | SageMaker Clarify (pre-training) |
| Visual ML data prep inside Studio | SageMaker Data Wrangler |
| Fast columnar file format for training | Parquet |
| Point-in-time correct training dataset | Feature Store with event time |