Domain 2 — ML Model Development (26%)

Focus: SageMaker training jobs, built-in algorithms, custom containers, hyperparameter tuning, and SageMaker JumpStart.

SageMaker Overview

graph TD
    Studio[SageMaker Studio<br/>Web IDE] --> NB[Notebooks]
    Studio --> DW[Data Wrangler]
    Studio --> Exp[Experiments]
    Studio --> MR[Model Registry]
    Studio --> Pipe[Pipelines]

    Train[Training] --> BI[Built-in Algorithms]
    Train --> Script[Script Mode<br/>Your code + SM framework]
    Train --> BYOC[BYOC<br/>Custom Docker]
    Train --> JS[JumpStart<br/>Pre-trained models]

    style Studio fill:#dbeafe,stroke:#3b82f6
    style Train fill:#dcfce7,stroke:#16a34a

SageMaker Training Job

Managed training on EC2 instances — you don't manage servers
Input: S3 data → Train → Output: model artefacts to S3

graph LR
    S3D[S3<br/>Training Data] --> TJ[Training Job<br/>EC2 Instance]
    TJ --> S3M[S3<br/>model.tar.gz]
    CW[CloudWatch<br/>Logs + Metrics] -.-> TJ

Training Input Modes

Mode	How	Use When
File Mode	Downloads entire dataset to instance before training	Small-medium datasets
Pipe Mode	Streams data directly from S3 to training container	Large datasets, reduce startup time
Fast File Mode	File mode performance + Pipe mode speed (POSIX interface)	Default recommendation

Training Instance Types

CPU: ml.m5, ml.c5 — tabular ML, light training
GPU: ml.p3, ml.p4 — deep learning
Multi-GPU: p3.16xlarge, p4d.24xlarge — large model training
Graviton: ml.m6g — cost-efficient CPU

Distributed Training

Strategy	How	Use When
Data Parallelism	Split data across GPUs, each has full model copy	Large datasets
Model Parallelism	Split model across GPUs	Model too large to fit one GPU (LLMs)
SageMaker Distributed Training Library	Optimised for AWS — SageMaker Data Parallel (SMDDP)	Multi-GPU / multi-node

Spot Training

Use Managed Spot Training to save up to 90% cost
Must use checkpointing to S3 — job resumes from checkpoint if interrupted
Set max_wait > max_run — max_wait = total time including interruptions
Best for: non-urgent training jobs, experiments

SageMaker — Training Modes

1. Built-in Algorithms

Pre-packaged algorithms, no code required
Highly optimised for AWS infrastructure (distributed, GPU-enabled)
Input: specific formats (CSV, RecordIO, Parquet)

2. Script Mode

Bring your own training script (Python)
AWS manages the framework container (TensorFlow, PyTorch, SKLearn, MXNet, XGBoost)
Most common pattern in practice

from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='train.py',
    role=role,
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    framework_version='2.0',
    py_version='py39',
    hyperparameters={'epochs': 10, 'lr': 0.001}
)
estimator.fit({'training': 's3://bucket/data/'})

3. BYOC (Bring Your Own Container)

Full control — custom Docker image pushed to ECR
Use when: unsupported framework, complex dependencies, custom runtime
Must implement: /opt/ml/input/, /opt/ml/model/, /opt/ml/output/ paths
Entry point: /opt/ml/code/train

4. SageMaker JumpStart

Pre-trained models from Hugging Face, TensorFlow Hub, PyTorch Hub
One-click fine-tune or deploy
Models: BERT, GPT-2, Stable Diffusion, Llama, etc.
Also includes Solution Templates (end-to-end ML solutions)

SageMaker Built-in Algorithms

Tabular / Classification / Regression

Algorithm	Type	Input Format	Notes
Linear Learner	Classification + Regression	RecordIO, CSV	Fast, interpretable baseline; normalise input
XGBoost	Classification + Regression	CSV, LibSVM, Parquet	Best for tabular; most-used in exams
K-Nearest Neighbours (KNN)	Classification + Regression	RecordIO, CSV	Lazy learner; good for recommendation
Factorisation Machines	Classification + Regression	RecordIO (sparse)	Sparse data, click prediction, recommendations

Clustering / Dimensionality Reduction

Algorithm	Type	Notes
K-Means	Clustering	Specify K; uses modified Lloyd's algorithm
PCA	Dimensionality Reduction	Two modes: regular (sparse), randomised (dense)
LDA (Latent Dirichlet Allocation)	Topic Modelling	Unsupervised; finds topics in text documents
Neural Topic Model (NTM)	Topic Modelling	Neural network-based; alternative to LDA

Anomaly Detection

Algorithm	Notes
Random Cut Forest (RCF)	Unsupervised anomaly detection; streaming data; assigns anomaly score
IP Insights	Detects unusual IP address usage patterns; fraud/security

NLP / Text

Algorithm	Type	Notes
BlazingText	Text Classification + Word2Vec	Two modes: supervised (classification), unsupervised (embeddings)
Seq2Seq	Sequence to Sequence	Translation, summarisation; needs tokenised data
Object2Vec	Embeddings	Generalised embedding for pairs (user-item, sentence pairs)

Computer Vision

Algorithm	Type	Notes
Image Classification	Multi-class image classification	ResNet-based; full training or transfer learning
Object Detection	Bounding box detection	SSD with VGG/ResNet backbone
Semantic Segmentation	Pixel-level classification	FCN, PSPNet, DeepLab V3

Time Series

Algorithm	Notes
DeepAR	Probabilistic time series forecasting; trains across multiple related time series; outputs confidence intervals

Quick Algorithm Selection Guide

Structured/tabular data?
  → XGBoost (classification/regression)
  → Linear Learner (fast baseline)
  → KNN (similarity-based)

Text data?
  → BlazingText (classification or word vectors)
  → LDA / NTM (topic discovery)

Image data?
  → Image Classification (label whole image)
  → Object Detection (find objects, bounding boxes)
  → Semantic Segmentation (pixel-level labels)

Time series?
  → DeepAR (multiple related series, probabilistic)

Anomaly detection?
  → RCF (tabular/time series anomalies)
  → IP Insights (network anomalies)

Sparse/recommendation?
  → Factorisation Machines
  → KNN

Clustering?
  → K-Means

Reduce dimensions?
  → PCA

SageMaker Experiments

Track and compare training runs — hyperparameters, metrics, artefacts
Experiment → Trial → Trial Components (training job, processing job)
Auto-captured when using SageMaker Training Jobs + Studio
Compare runs in a table/chart inside Studio

SageMaker Hyperparameter Tuning (HPO)

Automatically finds the best hyperparameter combination.

Search Strategies

Strategy	How	Use When
Bayesian Optimisation	Uses past results to pick next candidate smartly	Default; efficient for expensive training
Random Search	Random sampling across parameter ranges	Simple, good baseline
Grid Search	Try all combinations	Only when search space is tiny
Hyperband	Runs many jobs briefly, promotes promising ones	Fast, resource-efficient

Configuration

from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name='validation:rmse',
    objective_type='Minimize',
    hyperparameter_ranges={
        'num_round': IntegerParameter(50, 500),
        'eta': ContinuousParameter(0.01, 0.3),
        'max_depth': IntegerParameter(3, 10)
    },
    max_jobs=20,
    max_parallel_jobs=5
)

Best Practices

Use Bayesian for continuous parameters, small budget
Set max_parallel_jobs < max_jobs (Bayesian needs sequential feedback)
Use warm starting — re-use results from previous tuning jobs
Tune on validation metric, not training metric
Start with wide ranges → narrow after first run

SageMaker Processing Jobs

Run data preprocessing, feature engineering, and model evaluation as managed jobs
Uses SKLearnProcessor, PySparkProcessor, FrameworkProcessor
Separates processing from training — cleaner pipeline

from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(
    framework_version='1.0-1',
    role=role,
    instance_type='ml.m5.xlarge',
    instance_count=1
)

processor.run(
    code='preprocessing.py',
    inputs=[ProcessingInput(source='s3://bucket/raw/', destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/output', destination='s3://bucket/processed/')]
)

SageMaker Model Registry

Version and manage models — track lineage from training to deployment
Model Group → container for model versions
Model Package → a specific version with metadata (metrics, training data, algorithm)
Approval workflow: Pending → Approved → Rejected
Trigger auto-deployment pipeline when status → Approved
Tracks: training job, evaluation metrics, inference image, artefact location

graph LR
    TJ[Training Job] --> MP[Model Package<br/>Pending]
    MP -->|Review| A[Approved]
    MP -->|Review| R[Rejected]
    A -->|CI/CD trigger| Deploy[SageMaker Endpoint]

SageMaker Clarify — Explainability

Post-training explainability using SHAP (SHapley Additive exPlanations):

Global explanations — which features matter most overall
Local explanations — why the model made a specific prediction
Works with tabular, NLP, and CV models
Outputs feature importance report + partial dependence plots

Model Evaluation in SageMaker

SageMaker Model Evaluation Step (Pipelines)

# In SageMaker Pipelines
from sagemaker.workflow.steps import ProcessingStep

evaluation_step = ProcessingStep(
    name='EvaluateModel',
    processor=script_processor,
    inputs=[model_output, test_data],
    outputs=[evaluation_output],
    code='evaluate.py'
)

Offline Evaluation

Use a Processing Job to run evaluation script on test set
Output: evaluation.json with metrics
Feed into Model Registry as model metadata

A/B Testing / Shadow Testing

Deploy two model versions to the same endpoint
Route % traffic to each
Compare metrics before fully switching → see [[03 - Deployment & MLOps]]

Domain 2 — Exam Scenarios

Scenario	Answer
Best algorithm for tabular classification	XGBoost
Fast linear baseline model	Linear Learner
Classify images into categories	Image Classification
Detect objects with bounding boxes	Object Detection
NLP text classification	BlazingText (supervised mode)
Word embeddings / word2vec	BlazingText (unsupervised mode)
Time series forecasting, multiple series	DeepAR
Anomaly detection in streaming data	Random Cut Forest
Recommendation system, sparse data	Factorisation Machines
Topic modelling on documents	LDA or NTM
Find optimal hyperparameters efficiently	SageMaker HPO with Bayesian
Save 90% on training cost	Managed Spot Training + checkpointing
Use Hugging Face BERT with one click	SageMaker JumpStart
Custom ML framework not supported	BYOC (custom Docker in ECR)
Version and approve models before deploy	SageMaker Model Registry
Explain which features drove a prediction	SageMaker Clarify (SHAP)
Track and compare training runs	SageMaker Experiments
Large dataset, avoid downloading to disk	Pipe Mode or Fast File Mode
Distributed training for large model	SageMaker Model Parallel
Distributed training for large dataset	SageMaker Data Parallel