Back to Notes

Domain 2 — ML Model Development (26%)

Domain 2 — ML Model Development (26%)

Focus: SageMaker training jobs, built-in algorithms, custom containers, hyperparameter tuning, and SageMaker JumpStart.


SageMaker Overview

graph TD
    Studio[SageMaker Studio<br/>Web IDE] --> NB[Notebooks]
    Studio --> DW[Data Wrangler]
    Studio --> Exp[Experiments]
    Studio --> MR[Model Registry]
    Studio --> Pipe[Pipelines]

    Train[Training] --> BI[Built-in Algorithms]
    Train --> Script[Script Mode<br/>Your code + SM framework]
    Train --> BYOC[BYOC<br/>Custom Docker]
    Train --> JS[JumpStart<br/>Pre-trained models]

    style Studio fill:#dbeafe,stroke:#3b82f6
    style Train fill:#dcfce7,stroke:#16a34a

SageMaker Training Job

  • Managed training on EC2 instances — you don't manage servers
  • Input: S3 data → Train → Output: model artefacts to S3
graph LR
    S3D[S3<br/>Training Data] --> TJ[Training Job<br/>EC2 Instance]
    TJ --> S3M[S3<br/>model.tar.gz]
    CW[CloudWatch<br/>Logs + Metrics] -.-> TJ

Training Input Modes

ModeHowUse When
File ModeDownloads entire dataset to instance before trainingSmall-medium datasets
Pipe ModeStreams data directly from S3 to training containerLarge datasets, reduce startup time
Fast File ModeFile mode performance + Pipe mode speed (POSIX interface)Default recommendation

Training Instance Types

  • CPU: ml.m5, ml.c5 — tabular ML, light training
  • GPU: ml.p3, ml.p4 — deep learning
  • Multi-GPU: p3.16xlarge, p4d.24xlarge — large model training
  • Graviton: ml.m6g — cost-efficient CPU

Distributed Training

StrategyHowUse When
Data ParallelismSplit data across GPUs, each has full model copyLarge datasets
Model ParallelismSplit model across GPUsModel too large to fit one GPU (LLMs)
SageMaker Distributed Training LibraryOptimised for AWS — SageMaker Data Parallel (SMDDP)Multi-GPU / multi-node

Spot Training

  • Use Managed Spot Training to save up to 90% cost
  • Must use checkpointing to S3 — job resumes from checkpoint if interrupted
  • Set max_wait > max_runmax_wait = total time including interruptions
  • Best for: non-urgent training jobs, experiments

SageMaker — Training Modes

1. Built-in Algorithms

  • Pre-packaged algorithms, no code required
  • Highly optimised for AWS infrastructure (distributed, GPU-enabled)
  • Input: specific formats (CSV, RecordIO, Parquet)

2. Script Mode

  • Bring your own training script (Python)
  • AWS manages the framework container (TensorFlow, PyTorch, SKLearn, MXNet, XGBoost)
  • Most common pattern in practice
from sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point='train.py',
    role=role,
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    framework_version='2.0',
    py_version='py39',
    hyperparameters={'epochs': 10, 'lr': 0.001}
)
estimator.fit({'training': 's3://bucket/data/'})

3. BYOC (Bring Your Own Container)

  • Full control — custom Docker image pushed to ECR
  • Use when: unsupported framework, complex dependencies, custom runtime
  • Must implement: /opt/ml/input/, /opt/ml/model/, /opt/ml/output/ paths
  • Entry point: /opt/ml/code/train

4. SageMaker JumpStart

  • Pre-trained models from Hugging Face, TensorFlow Hub, PyTorch Hub
  • One-click fine-tune or deploy
  • Models: BERT, GPT-2, Stable Diffusion, Llama, etc.
  • Also includes Solution Templates (end-to-end ML solutions)

SageMaker Built-in Algorithms

Tabular / Classification / Regression

AlgorithmTypeInput FormatNotes
Linear LearnerClassification + RegressionRecordIO, CSVFast, interpretable baseline; normalise input
XGBoostClassification + RegressionCSV, LibSVM, ParquetBest for tabular; most-used in exams
K-Nearest Neighbours (KNN)Classification + RegressionRecordIO, CSVLazy learner; good for recommendation
Factorisation MachinesClassification + RegressionRecordIO (sparse)Sparse data, click prediction, recommendations

Clustering / Dimensionality Reduction

AlgorithmTypeNotes
K-MeansClusteringSpecify K; uses modified Lloyd's algorithm
PCADimensionality ReductionTwo modes: regular (sparse), randomised (dense)
LDA (Latent Dirichlet Allocation)Topic ModellingUnsupervised; finds topics in text documents
Neural Topic Model (NTM)Topic ModellingNeural network-based; alternative to LDA

Anomaly Detection

AlgorithmNotes
Random Cut Forest (RCF)Unsupervised anomaly detection; streaming data; assigns anomaly score
IP InsightsDetects unusual IP address usage patterns; fraud/security

NLP / Text

AlgorithmTypeNotes
BlazingTextText Classification + Word2VecTwo modes: supervised (classification), unsupervised (embeddings)
Seq2SeqSequence to SequenceTranslation, summarisation; needs tokenised data
Object2VecEmbeddingsGeneralised embedding for pairs (user-item, sentence pairs)

Computer Vision

AlgorithmTypeNotes
Image ClassificationMulti-class image classificationResNet-based; full training or transfer learning
Object DetectionBounding box detectionSSD with VGG/ResNet backbone
Semantic SegmentationPixel-level classificationFCN, PSPNet, DeepLab V3

Time Series

AlgorithmNotes
DeepARProbabilistic time series forecasting; trains across multiple related time series; outputs confidence intervals

Quick Algorithm Selection Guide

Structured/tabular data?
  → XGBoost (classification/regression)
  → Linear Learner (fast baseline)
  → KNN (similarity-based)

Text data?
  → BlazingText (classification or word vectors)
  → LDA / NTM (topic discovery)

Image data?
  → Image Classification (label whole image)
  → Object Detection (find objects, bounding boxes)
  → Semantic Segmentation (pixel-level labels)

Time series?
  → DeepAR (multiple related series, probabilistic)

Anomaly detection?
  → RCF (tabular/time series anomalies)
  → IP Insights (network anomalies)

Sparse/recommendation?
  → Factorisation Machines
  → KNN

Clustering?
  → K-Means

Reduce dimensions?
  → PCA

SageMaker Experiments

  • Track and compare training runs — hyperparameters, metrics, artefacts
  • ExperimentTrialTrial Components (training job, processing job)
  • Auto-captured when using SageMaker Training Jobs + Studio
  • Compare runs in a table/chart inside Studio

SageMaker Hyperparameter Tuning (HPO)

Automatically finds the best hyperparameter combination.

Search Strategies

StrategyHowUse When
Bayesian OptimisationUses past results to pick next candidate smartlyDefault; efficient for expensive training
Random SearchRandom sampling across parameter rangesSimple, good baseline
Grid SearchTry all combinationsOnly when search space is tiny
HyperbandRuns many jobs briefly, promotes promising onesFast, resource-efficient

Configuration

from sagemaker.tuner import HyperparameterTuner, IntegerParameter, ContinuousParameter

tuner = HyperparameterTuner(
    estimator=estimator,
    objective_metric_name='validation:rmse',
    objective_type='Minimize',
    hyperparameter_ranges={
        'num_round': IntegerParameter(50, 500),
        'eta': ContinuousParameter(0.01, 0.3),
        'max_depth': IntegerParameter(3, 10)
    },
    max_jobs=20,
    max_parallel_jobs=5
)

Best Practices

  • Use Bayesian for continuous parameters, small budget
  • Set max_parallel_jobs < max_jobs (Bayesian needs sequential feedback)
  • Use warm starting — re-use results from previous tuning jobs
  • Tune on validation metric, not training metric
  • Start with wide ranges → narrow after first run

SageMaker Processing Jobs

  • Run data preprocessing, feature engineering, and model evaluation as managed jobs
  • Uses SKLearnProcessor, PySparkProcessor, FrameworkProcessor
  • Separates processing from training — cleaner pipeline
from sagemaker.sklearn.processing import SKLearnProcessor

processor = SKLearnProcessor(
    framework_version='1.0-1',
    role=role,
    instance_type='ml.m5.xlarge',
    instance_count=1
)

processor.run(
    code='preprocessing.py',
    inputs=[ProcessingInput(source='s3://bucket/raw/', destination='/opt/ml/processing/input')],
    outputs=[ProcessingOutput(source='/opt/ml/processing/output', destination='s3://bucket/processed/')]
)

SageMaker Model Registry

  • Version and manage models — track lineage from training to deployment
  • Model Group → container for model versions
  • Model Package → a specific version with metadata (metrics, training data, algorithm)
  • Approval workflow: Pending → Approved → Rejected
  • Trigger auto-deployment pipeline when status → Approved
  • Tracks: training job, evaluation metrics, inference image, artefact location
graph LR
    TJ[Training Job] --> MP[Model Package<br/>Pending]
    MP -->|Review| A[Approved]
    MP -->|Review| R[Rejected]
    A -->|CI/CD trigger| Deploy[SageMaker Endpoint]

SageMaker Clarify — Explainability

Post-training explainability using SHAP (SHapley Additive exPlanations):

  • Global explanations — which features matter most overall
  • Local explanations — why the model made a specific prediction
  • Works with tabular, NLP, and CV models
  • Outputs feature importance report + partial dependence plots

Model Evaluation in SageMaker

SageMaker Model Evaluation Step (Pipelines)

# In SageMaker Pipelines
from sagemaker.workflow.steps import ProcessingStep

evaluation_step = ProcessingStep(
    name='EvaluateModel',
    processor=script_processor,
    inputs=[model_output, test_data],
    outputs=[evaluation_output],
    code='evaluate.py'
)

Offline Evaluation

  • Use a Processing Job to run evaluation script on test set
  • Output: evaluation.json with metrics
  • Feed into Model Registry as model metadata

A/B Testing / Shadow Testing

  • Deploy two model versions to the same endpoint
  • Route % traffic to each
  • Compare metrics before fully switching → see [[03 - Deployment & MLOps]]

Domain 2 — Exam Scenarios

ScenarioAnswer
Best algorithm for tabular classificationXGBoost
Fast linear baseline modelLinear Learner
Classify images into categoriesImage Classification
Detect objects with bounding boxesObject Detection
NLP text classificationBlazingText (supervised mode)
Word embeddings / word2vecBlazingText (unsupervised mode)
Time series forecasting, multiple seriesDeepAR
Anomaly detection in streaming dataRandom Cut Forest
Recommendation system, sparse dataFactorisation Machines
Topic modelling on documentsLDA or NTM
Find optimal hyperparameters efficientlySageMaker HPO with Bayesian
Save 90% on training costManaged Spot Training + checkpointing
Use Hugging Face BERT with one clickSageMaker JumpStart
Custom ML framework not supportedBYOC (custom Docker in ECR)
Version and approve models before deploySageMaker Model Registry
Explain which features drove a predictionSageMaker Clarify (SHAP)
Track and compare training runsSageMaker Experiments
Large dataset, avoid downloading to diskPipe Mode or Fast File Mode
Distributed training for large modelSageMaker Model Parallel
Distributed training for large datasetSageMaker Data Parallel