Domain 2 — ML Model Development (26%)
Focus: SageMaker training jobs, built-in algorithms, custom containers, hyperparameter tuning, and SageMaker JumpStart.
SageMaker Overview
graph TD
Studio[SageMaker Studio<br/>Web IDE] --> NB[Notebooks]
Studio --> DW[Data Wrangler]
Studio --> Exp[Experiments]
Studio --> MR[Model Registry]
Studio --> Pipe[Pipelines]
Train[Training] --> BI[Built-in Algorithms]
Train --> Script[Script Mode<br/>Your code + SM framework]
Train --> BYOC[BYOC<br/>Custom Docker]
Train --> JS[JumpStart<br/>Pre-trained models]
style Studio fill:#dbeafe,stroke:#3b82f6
style Train fill:#dcfce7,stroke:#16a34a
SageMaker Training Job
- Managed training on EC2 instances — you don't manage servers
- Input: S3 data → Train → Output: model artefacts to S3
graph LR
S3D[S3<br/>Training Data] --> TJ[Training Job<br/>EC2 Instance]
TJ --> S3M[S3<br/>model.tar.gz]
CW[CloudWatch<br/>Logs + Metrics] -.-> TJ
Training Input Modes
| Mode | How | Use When |
|---|
| File Mode | Downloads entire dataset to instance before training | Small-medium datasets |
| Pipe Mode | Streams data directly from S3 to training container | Large datasets, reduce startup time |
| Fast File Mode | File mode performance + Pipe mode speed (POSIX interface) | Default recommendation |
Training Instance Types
- CPU: ml.m5, ml.c5 — tabular ML, light training
- GPU: ml.p3, ml.p4 — deep learning
- Multi-GPU: p3.16xlarge, p4d.24xlarge — large model training
- Graviton: ml.m6g — cost-efficient CPU
Distributed Training
| Strategy | How | Use When |
|---|
| Data Parallelism | Split data across GPUs, each has full model copy | Large datasets |
| Model Parallelism | Split model across GPUs | Model too large to fit one GPU (LLMs) |
| SageMaker Distributed Training Library | Optimised for AWS — SageMaker Data Parallel (SMDDP) | Multi-GPU / multi-node |
Spot Training
- Use Managed Spot Training to save up to 90% cost
- Must use checkpointing to S3 — job resumes from checkpoint if interrupted
- Set
max_wait > max_run — max_wait = total time including interruptions
- Best for: non-urgent training jobs, experiments
SageMaker — Training Modes
1. Built-in Algorithms
- Pre-packaged algorithms, no code required
- Highly optimised for AWS infrastructure (distributed, GPU-enabled)
- Input: specific formats (CSV, RecordIO, Parquet)
2. Script Mode
- Bring your own training script (Python)
- AWS manages the framework container (TensorFlow, PyTorch, SKLearn, MXNet, XGBoost)
- Most common pattern in practice
from sagemaker.pytorch import PyTorch
estimator = PyTorch(
entry_point='train.py',
role=role,
instance_type='ml.p3.2xlarge',
instance_count=1,
framework_version='2.0',
py_version='py39',
hyperparameters={'epochs': 10, 'lr': 0.001}
)
estimator.fit({'training': 's3://bucket/data/'})
3. BYOC (Bring Your Own Container)
- Full control — custom Docker image pushed to ECR
- Use when: unsupported framework, complex dependencies, custom runtime
- Must implement:
/opt/ml/input/, /opt/ml/model/, /opt/ml/output/ paths
- Entry point:
/opt/ml/code/train
4. SageMaker JumpStart
- Pre-trained models from Hugging Face, TensorFlow Hub, PyTorch Hub
- One-click fine-tune or deploy
- Models: BERT, GPT-2, Stable Diffusion, Llama, etc.
- Also includes Solution Templates (end-to-end ML solutions)
SageMaker Built-in Algorithms
Tabular / Classification / Regression
| Algorithm | Type | Input Format | Notes |
|---|
| Linear Learner | Classification + Regression | RecordIO, CSV | Fast, interpretable baseline; normalise input |
| XGBoost | Classification + Regression | CSV, LibSVM, Parquet | Best for tabular; most-used in exams |
| K-Nearest Neighbours (KNN) | Classification + Regression | RecordIO, CSV | Lazy learner; good for recommendation |
| Factorisation Machines | Classification + Regression | RecordIO (sparse) | Sparse data, click prediction, recommendations |
Clustering / Dimensionality Reduction
| Algorithm | Type | Notes |
|---|
| K-Means | Clustering | Specify K; uses modified Lloyd's algorithm |
| PCA | Dimensionality Reduction | Two modes: regular (sparse), randomised (dense) |
| LDA (Latent Dirichlet Allocation) | Topic Modelling | Unsupervised; finds topics in text documents |
| Neural Topic Model (NTM) | Topic Modelling | Neural network-based; alternative to LDA |
Anomaly Detection
| Algorithm | Notes |
|---|
| Random Cut Forest (RCF) | Unsupervised anomaly detection; streaming data; assigns anomaly score |
| IP Insights | Detects unusual IP address usage patterns; fraud/security |
NLP / Text
| Algorithm | Type | Notes |
|---|
| BlazingText | Text Classification + Word2Vec | Two modes: supervised (classification), unsupervised (embeddings) |
| Seq2Seq | Sequence to Sequence | Translation, summarisation; needs tokenised data |
| Object2Vec | Embeddings | Generalised embedding for pairs (user-item, sentence pairs) |
Computer Vision
| Algorithm | Type | Notes |
|---|
| Image Classification | Multi-class image classification | ResNet-based; full training or transfer learning |
| Object Detection | Bounding box detection | SSD with VGG/ResNet backbone |
| Semantic Segmentation | Pixel-level classification | FCN, PSPNet, DeepLab V3 |
Time Series
| Algorithm | Notes |
|---|
| DeepAR | Probabilistic time series forecasting; trains across multiple related time series; outputs confidence intervals |
Quick Algorithm Selection Guide
Structured/tabular data?
→ XGBoost (classification/regression)
→ Linear Learner (fast baseline)
→ KNN (similarity-based)
Text data?
→ BlazingText (classification or word vectors)
→ LDA / NTM (topic discovery)
Image data?
→ Image Classification (label whole image)
→ Object Detection (find objects, bounding boxes)
→ Semantic Segmentation (pixel-level labels)
Time series?
→ DeepAR (multiple related series, probabilistic)
Anomaly detection?
→ RCF (tabular/time series anomalies)
→ IP Insights (network anomalies)
Sparse/recommendation?
→ Factorisation Machines
→ KNN
Clustering?
→ K-Means
Reduce dimensions?
→ PCA
SageMaker Experiments
- Track and compare training runs — hyperparameters, metrics, artefacts
- Experiment → Trial → Trial Components (training job, processing job)
- Auto-captured when using SageMaker Training Jobs + Studio
- Compare runs in a table/chart inside Studio
SageMaker Hyperparameter Tuning (HPO)
Automatically finds the best hyperparameter combination.
Search Strategies
| Strategy | How | Use When |
|---|
| Bayesian Optimisation | Uses past results to pick next candidate smartly | Default; efficient for expensive training |
| Random Search | Random sampling across parameter ranges | Simple, good baseline |
| Grid Search | Try all combinations | Only when search space is tiny |
| Hyperband | Runs many jobs briefly, promotes promising ones | Fast, resource-efficient |
Configuration
from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter
tuner = HyperparameterTuner(
estimator=estimator,
objective_metric_name='validation:rmse',
objective_type='Minimize',
hyperparameter_ranges={
'num_round': IntegerParameter(50, 500),
'eta': ContinuousParameter(0.01, 0.3),
'max_depth': IntegerParameter(3, 10)
},
max_jobs=20,
max_parallel_jobs=5
)
Best Practices
- Use Bayesian for continuous parameters, small budget
- Set max_parallel_jobs < max_jobs (Bayesian needs sequential feedback)
- Use warm starting — re-use results from previous tuning jobs
- Tune on validation metric, not training metric
- Start with wide ranges → narrow after first run
SageMaker Processing Jobs
- Run data preprocessing, feature engineering, and model evaluation as managed jobs
- Uses SKLearnProcessor, PySparkProcessor, FrameworkProcessor
- Separates processing from training — cleaner pipeline
from sagemaker.sklearn.processing import SKLearnProcessor
processor = SKLearnProcessor(
framework_version='1.0-1',
role=role,
instance_type='ml.m5.xlarge',
instance_count=1
)
processor.run(
code='preprocessing.py',
inputs=[ProcessingInput(source='s3://bucket/raw/', destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(source='/opt/ml/processing/output', destination='s3://bucket/processed/')]
)
SageMaker Model Registry
- Version and manage models — track lineage from training to deployment
- Model Group → container for model versions
- Model Package → a specific version with metadata (metrics, training data, algorithm)
- Approval workflow: Pending → Approved → Rejected
- Trigger auto-deployment pipeline when status → Approved
- Tracks: training job, evaluation metrics, inference image, artefact location
graph LR
TJ[Training Job] --> MP[Model Package<br/>Pending]
MP -->|Review| A[Approved]
MP -->|Review| R[Rejected]
A -->|CI/CD trigger| Deploy[SageMaker Endpoint]
SageMaker Clarify — Explainability
Post-training explainability using SHAP (SHapley Additive exPlanations):
- Global explanations — which features matter most overall
- Local explanations — why the model made a specific prediction
- Works with tabular, NLP, and CV models
- Outputs feature importance report + partial dependence plots
Model Evaluation in SageMaker
SageMaker Model Evaluation Step (Pipelines)
from sagemaker.workflow.steps import ProcessingStep
evaluation_step = ProcessingStep(
name='EvaluateModel',
processor=script_processor,
inputs=[model_output, test_data],
outputs=[evaluation_output],
code='evaluate.py'
)
Offline Evaluation
- Use a Processing Job to run evaluation script on test set
- Output:
evaluation.json with metrics
- Feed into Model Registry as model metadata
A/B Testing / Shadow Testing
- Deploy two model versions to the same endpoint
- Route % traffic to each
- Compare metrics before fully switching → see [[03 - Deployment & MLOps]]
Domain 2 — Exam Scenarios
| Scenario | Answer |
|---|
| Best algorithm for tabular classification | XGBoost |
| Fast linear baseline model | Linear Learner |
| Classify images into categories | Image Classification |
| Detect objects with bounding boxes | Object Detection |
| NLP text classification | BlazingText (supervised mode) |
| Word embeddings / word2vec | BlazingText (unsupervised mode) |
| Time series forecasting, multiple series | DeepAR |
| Anomaly detection in streaming data | Random Cut Forest |
| Recommendation system, sparse data | Factorisation Machines |
| Topic modelling on documents | LDA or NTM |
| Find optimal hyperparameters efficiently | SageMaker HPO with Bayesian |
| Save 90% on training cost | Managed Spot Training + checkpointing |
| Use Hugging Face BERT with one click | SageMaker JumpStart |
| Custom ML framework not supported | BYOC (custom Docker in ECR) |
| Version and approve models before deploy | SageMaker Model Registry |
| Explain which features drove a prediction | SageMaker Clarify (SHAP) |
| Track and compare training runs | SageMaker Experiments |
| Large dataset, avoid downloading to disk | Pipe Mode or Fast File Mode |
| Distributed training for large model | SageMaker Model Parallel |
| Distributed training for large dataset | SageMaker Data Parallel |