Back to Notes

Domain 3 — Deployment & Orchestration of ML Workflows (22%)

Domain 3 — Deployment & Orchestration of ML Workflows (22%)

Focus: SageMaker endpoint types, inference patterns, SageMaker Pipelines, CI/CD for ML, and MLOps best practices.


SageMaker Inference — Endpoint Types

graph TD
    Inference[Inference Options] --> RT[Real-time Endpoint<br/>Low latency<br/>Always on]
    Inference --> BT[Batch Transform<br/>Bulk offline<br/>No endpoint]
    Inference --> SL[Serverless Inference<br/>Spiky / infrequent<br/>Scale to zero]
    Inference --> AI[Async Inference<br/>Large payloads<br/>Queue-based]
    Inference --> MMS[Multi-Model Endpoint<br/>Many models, one endpoint<br/>Cost efficient]

    style RT fill:#dcfce7,stroke:#16a34a
    style BT fill:#fef9c3,stroke:#ca8a04
    style SL fill:#fed7aa,stroke:#ea580c
    style AI fill:#f3e8ff,stroke:#9333ea
    style MMS fill:#fce7f3,stroke:#db2777

1. Real-time Endpoint

  • Persistent endpoint, always-on instances
  • Latency: ms-level
  • Payload: up to 6 MB
  • Timeout: 60 seconds
  • Use: live predictions, user-facing APIs, fraud detection
  • Auto-scales via Application Auto Scaling (scale on SageMaker:Invocations metric)
predictor = estimator.deploy(
    initial_instance_count=2,   # min 2 for HA across AZs
    instance_type='ml.m5.xlarge'
)
response = predictor.predict(data)

2. Batch Transform

  • No persistent endpoint — spins up instances, runs inference, shuts down
  • Input: S3 files → Output: S3 files
  • Handles large files automatically via SplitType and AssembleWith
  • Use: offline scoring, pre-computing predictions, large datasets
  • Most cost-effective for non-real-time bulk inference
transformer = model.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path='s3://bucket/predictions/'
)
transformer.transform(
    data='s3://bucket/test-data/',
    split_type='Line',
    content_type='text/csv'
)

3. Serverless Inference

  • No instances to manage — scales to zero when idle
  • Cold start latency (seconds) on first request
  • Memory: 1024 MB – 6144 MB
  • Payload: up to 4 MB
  • Timeout: 60 seconds
  • Billed per invocation + processing time (no idle cost)
  • Use: spiky/infrequent traffic, dev/test, batch of occasional requests

4. Async Inference

  • Requests are queued — response is written to S3 asynchronously
  • Payload: up to 1 GB (largest of all endpoint types)
  • Timeout: up to 15 minutes
  • Auto-scales to zero when queue is empty
  • SNS/S3 notification when complete
  • Use: large payloads (documents, video), long inference time, variable load

5. Multi-Model Endpoint (MME)

  • Host thousands of models behind one endpoint
  • Models loaded/evicted dynamically from S3 into memory (LRU cache)
  • Reduces cost — no dedicated endpoint per model
  • Use: per-customer models, A/B variants, many similar models
  • Supports: built-in algorithms, custom containers

6. Multi-Container Endpoint

  • Run different model types in separate containers on same endpoint
  • Each container handles different inference steps
  • Use: ensembles, pre/post-processing pipelines, different frameworks

Endpoint Comparison

TypeLatencyMax PayloadCost ModelUse Case
Real-timems6 MBPer-hourLive predictions
Batch TransformMinutesUnlimitedPer-jobBulk offline scoring
Serverlessms-seconds4 MBPer-invocationSporadic traffic
AsyncSeconds-minutes1 GBPer-hour + queueLarge payloads
Multi-Modelms + cold start6 MBPer-hour (shared)Many models, cost savings

Inference Optimisation

SageMaker Inference Recommender

  • Benchmarks your model across instance types → recommends best cost/performance
  • Default job: 45-minute benchmark across SageMaker-recommended instances
  • Advanced job: custom load test against your specific traffic pattern

SageMaker Neo

  • Compile and optimise models for specific hardware targets
  • Reduces model size, increases inference speed (up to 25×)
  • Targets: cloud (ml.c5, ml.p3), edge (Jetson, Raspberry Pi, Greengrass)
  • Supports: TensorFlow, PyTorch, MXNet, ONNX, XGBoost

Elastic Inference (EI)

  • Attach fractional GPU (for inference only) to CPU instance
  • Lower cost than full GPU instance for inference workloads
  • Now largely superseded by Inferentia chips

AWS Inferentia / Trainium

  • Inferentia: AWS-custom chip optimised for inference — up to 70% cost reduction
  • Trainium: AWS-custom chip for training — most cost-efficient for large models
  • Use with inf1, inf2, trn1 instance types

SageMaker Pipelines — MLOps

Automated, repeatable ML workflows with DAG-based steps.

graph LR
    PP[Processing Step<br/>Data prep] --> TS[Training Step<br/>Train model]
    TS --> ES[Evaluation Step<br/>Run metrics]
    ES -->|metrics pass| RS[Register Step<br/>Model Registry]
    ES -->|metrics fail| Fail[Fail Step]
    RS --> DS[Deploy Step]

    style PP fill:#dbeafe,stroke:#3b82f6
    style TS fill:#dcfce7,stroke:#16a34a
    style ES fill:#fef9c3,stroke:#ca8a04
    style RS fill:#f3e8ff,stroke:#9333ea

Pipeline Steps

StepPurpose
ProcessingStepData preprocessing, feature engineering, evaluation
TrainingStepTrain a model
TuningStepHyperparameter tuning
TransformStepBatch transform / inference
RegisterModelRegister to Model Registry
CreateModelCreate deployable model
ConditionStepBranching logic (if/else) based on metrics
LambdaStepRun Lambda function
FailStepStop pipeline with error message

Pipeline Parameters

from sagemaker.workflow.parameters import ParameterString, ParameterFloat

training_instance_type = ParameterString(
    name='TrainingInstanceType',
    default_value='ml.m5.xlarge'
)
accuracy_threshold = ParameterFloat(name='AccuracyThreshold', default_value=0.85)

Pipeline Caching

  • Steps can be cached — if input data + config unchanged, skip re-running
  • Reduces cost and time in iterative development
  • Set cache_config=CacheConfig(enable_caching=True, expire_after='PT1H')

CI/CD for ML

graph LR
    Code[Code Push<br/>GitHub] --> CB[CodeBuild<br/>Test + Build]
    CB --> CP[CodePipeline<br/>Orchestrate]
    CP --> SM[SageMaker Pipeline<br/>Train + Evaluate]
    SM -->|Approved| Deploy[Deploy to Staging]
    Deploy -->|Tests pass| Prod[Deploy to Production]

    style Code fill:#dbeafe,stroke:#3b82f6
    style SM fill:#dcfce7,stroke:#16a34a
    style Prod fill:#fed7aa,stroke:#ea580c

Key Services

  • CodePipeline — orchestrates the CI/CD pipeline
  • CodeBuild — runs tests, builds containers, pushes to ECR
  • CodeCommit / GitHub / Bitbucket — source control
  • EventBridge — trigger pipeline on: new data in S3, model approval, schedule
  • Lambda — lightweight automation steps

MLOps Maturity Levels

LevelDescription
Level 0Manual process — no automation
Level 1Automated training pipeline, manual deployment
Level 2Fully automated CI/CD for training + deployment

Model Deployment Strategies

Blue/Green Deployment

  • Deploy new version alongside old version
  • Shift traffic from old (blue) → new (green) gradually
  • SageMaker supports traffic shifting on endpoints via production variants
  • Easy rollback — switch traffic back to blue
# Two production variants on one endpoint
predictor.update_endpoint(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    update_endpoint_config=True
)

Canary Deployment

  • Send small % of traffic (e.g. 5%) to new model
  • Monitor metrics before scaling up
  • SageMaker endpoint variant weights: {"variant-A": 0.95, "variant-B": 0.05}

A/B Testing

  • Route traffic to two model variants simultaneously
  • Measure business metrics (click-through, conversion)
  • SageMaker Endpoint Production Variants — each variant = different model/instance

Shadow Testing (Challenger Model)

  • New model receives same traffic as production model
  • Shadow model's predictions are not served to users — only logged
  • Compare shadow vs production metrics without risk
  • SageMaker supports shadow variants natively

Lambda + SageMaker Patterns

Serverless Inference with Lambda

API Gateway → Lambda → SageMaker Endpoint → Lambda → Response
  • Lambda calls sagemaker-runtime.invoke_endpoint()
  • Good for: API-backed inference, custom auth/pre/post-processing

Lambda for Lightweight Models

  • For very simple models (rule-based, tiny sklearn) — run model inside Lambda with no SageMaker
  • Limitations: 10 GB memory, 15 min timeout, cold start

Step Functions for ML

  • Visual workflow orchestration alternative to SageMaker Pipelines
  • Pre-built Data Science SDK for Step Functions with SageMaker integration
  • Use when: need more complex orchestration, mix SageMaker + non-SageMaker steps
  • States: Task, Choice, Parallel, Wait, Fail, Succeed

Amazon EventBridge for ML Automation

Event SourceTrigger Action
S3 new objectStart Glue ETL → SageMaker Pipeline
SageMaker Training completeLambda → evaluate → register model
Model Registry status → ApprovedCodePipeline → deploy to staging
CloudWatch alarm (drift detected)Trigger retraining pipeline
Schedule (cron)Weekly retraining pipeline

Domain 3 — Exam Scenarios

ScenarioAnswer
Live predictions, low latencyReal-time Endpoint
Score 10M records overnightBatch Transform
Infrequent predictions, pay-per-useServerless Inference
Large document inference, 500MB payloadAsync Inference
1000 customer-specific models, cost-efficientMulti-Model Endpoint
Test new model without user impactShadow Testing (Challenger variant)
Gradually shift traffic to new modelCanary / Blue-Green deployment
Automate train → evaluate → deploy pipelineSageMaker Pipelines
Trigger retraining when new data arrives in S3EventBridge → SageMaker Pipeline
Optimise model for edge deviceSageMaker Neo
Cheapest GPU option for inferenceElastic Inference or Inferentia (inf1/inf2)
Find best instance type for inferenceSageMaker Inference Recommender
Version control + approval for modelsSageMaker Model Registry
Reuse unchanged pipeline stepsSageMaker Pipeline Caching
CI/CD for ML on AWSCodePipeline + CodeBuild + SageMaker Pipelines