Domain 3 — Deployment & Orchestration of ML Workflows (22%)
Focus: SageMaker endpoint types, inference patterns, SageMaker Pipelines, CI/CD for ML, and MLOps best practices.
SageMaker Inference — Endpoint Types
graph TD
Inference[Inference Options] --> RT[Real-time Endpoint<br/>Low latency<br/>Always on]
Inference --> BT[Batch Transform<br/>Bulk offline<br/>No endpoint]
Inference --> SL[Serverless Inference<br/>Spiky / infrequent<br/>Scale to zero]
Inference --> AI[Async Inference<br/>Large payloads<br/>Queue-based]
Inference --> MMS[Multi-Model Endpoint<br/>Many models, one endpoint<br/>Cost efficient]
style RT fill:#dcfce7,stroke:#16a34a
style BT fill:#fef9c3,stroke:#ca8a04
style SL fill:#fed7aa,stroke:#ea580c
style AI fill:#f3e8ff,stroke:#9333ea
style MMS fill:#fce7f3,stroke:#db2777
1. Real-time Endpoint
Persistent endpoint, always-on instances
Latency: ms-level
Payload: up to 6 MB
Timeout: 60 seconds
Use: live predictions, user-facing APIs, fraud detection
Auto-scales via Application Auto Scaling (scale on SageMaker:Invocations metric)
predictor = estimator.deploy(
initial_instance_count=2 ,
instance_type='ml.m5.xlarge'
)
response = predictor.predict(data)
2. Batch Transform
No persistent endpoint — spins up instances, runs inference, shuts down
Input: S3 files → Output: S3 files
Handles large files automatically via SplitType and AssembleWith
Use: offline scoring, pre-computing predictions, large datasets
Most cost-effective for non-real-time bulk inference
transformer = model.transformer(
instance_count=1 ,
instance_type='ml.m5.xlarge' ,
output_path='s3://bucket/predictions/'
)
transformer.transform(
data='s3://bucket/test-data/' ,
split_type='Line' ,
content_type='text/csv'
)
3. Serverless Inference
No instances to manage — scales to zero when idle
Cold start latency (seconds) on first request
Memory: 1024 MB – 6144 MB
Payload: up to 4 MB
Timeout: 60 seconds
Billed per invocation + processing time (no idle cost)
Use: spiky/infrequent traffic, dev/test, batch of occasional requests
4. Async Inference
Requests are queued — response is written to S3 asynchronously
Payload: up to 1 GB (largest of all endpoint types)
Timeout: up to 15 minutes
Auto-scales to zero when queue is empty
SNS/S3 notification when complete
Use: large payloads (documents, video), long inference time, variable load
5. Multi-Model Endpoint (MME)
Host thousands of models behind one endpoint
Models loaded/evicted dynamically from S3 into memory (LRU cache)
Reduces cost — no dedicated endpoint per model
Use: per-customer models, A/B variants, many similar models
Supports: built-in algorithms, custom containers
6. Multi-Container Endpoint
Run different model types in separate containers on same endpoint
Each container handles different inference steps
Use: ensembles, pre/post-processing pipelines, different frameworks
Endpoint Comparison
Type Latency Max Payload Cost Model Use Case Real-time ms 6 MB Per-hour Live predictions Batch Transform Minutes Unlimited Per-job Bulk offline scoring Serverless ms-seconds 4 MB Per-invocation Sporadic traffic Async Seconds-minutes 1 GB Per-hour + queue Large payloads Multi-Model ms + cold start 6 MB Per-hour (shared) Many models, cost savings
Inference Optimisation
SageMaker Inference Recommender
Benchmarks your model across instance types → recommends best cost/performance
Default job: 45-minute benchmark across SageMaker-recommended instances
Advanced job: custom load test against your specific traffic pattern
SageMaker Neo
Compile and optimise models for specific hardware targets
Reduces model size, increases inference speed (up to 25×)
Targets: cloud (ml.c5, ml.p3), edge (Jetson, Raspberry Pi, Greengrass)
Supports: TensorFlow, PyTorch, MXNet, ONNX, XGBoost
Elastic Inference (EI)
Attach fractional GPU (for inference only) to CPU instance
Lower cost than full GPU instance for inference workloads
Now largely superseded by Inferentia chips
AWS Inferentia / Trainium
Inferentia: AWS-custom chip optimised for inference — up to 70% cost reduction
Trainium: AWS-custom chip for training — most cost-efficient for large models
Use with inf1, inf2, trn1 instance types
SageMaker Pipelines — MLOps
Automated, repeatable ML workflows with DAG-based steps .
graph LR
PP[Processing Step<br/>Data prep] --> TS[Training Step<br/>Train model]
TS --> ES[Evaluation Step<br/>Run metrics]
ES -->|metrics pass| RS[Register Step<br/>Model Registry]
ES -->|metrics fail| Fail[Fail Step]
RS --> DS[Deploy Step]
style PP fill:#dbeafe,stroke:#3b82f6
style TS fill:#dcfce7,stroke:#16a34a
style ES fill:#fef9c3,stroke:#ca8a04
style RS fill:#f3e8ff,stroke:#9333ea
Pipeline Steps
Step Purpose ProcessingStepData preprocessing, feature engineering, evaluation TrainingStepTrain a model TuningStepHyperparameter tuning TransformStepBatch transform / inference RegisterModelRegister to Model Registry CreateModelCreate deployable model ConditionStepBranching logic (if/else) based on metrics LambdaStepRun Lambda function FailStepStop pipeline with error message
Pipeline Parameters
from sagemaker.workflow.parameters import ParameterString, ParameterFloat
training_instance_type = ParameterString(
name='TrainingInstanceType' ,
default_value='ml.m5.xlarge'
)
accuracy_threshold = ParameterFloat(name='AccuracyThreshold' , default_value=0.85 )
Pipeline Caching
Steps can be cached — if input data + config unchanged, skip re-running
Reduces cost and time in iterative development
Set cache_config=CacheConfig(enable_caching=True, expire_after='PT1H')
CI/CD for ML
graph LR
Code[Code Push<br/>GitHub] --> CB[CodeBuild<br/>Test + Build]
CB --> CP[CodePipeline<br/>Orchestrate]
CP --> SM[SageMaker Pipeline<br/>Train + Evaluate]
SM -->|Approved| Deploy[Deploy to Staging]
Deploy -->|Tests pass| Prod[Deploy to Production]
style Code fill:#dbeafe,stroke:#3b82f6
style SM fill:#dcfce7,stroke:#16a34a
style Prod fill:#fed7aa,stroke:#ea580c
Key Services
CodePipeline — orchestrates the CI/CD pipeline
CodeBuild — runs tests, builds containers, pushes to ECR
CodeCommit / GitHub / Bitbucket — source control
EventBridge — trigger pipeline on: new data in S3, model approval, schedule
Lambda — lightweight automation steps
MLOps Maturity Levels
Level Description Level 0 Manual process — no automation Level 1 Automated training pipeline, manual deployment Level 2 Fully automated CI/CD for training + deployment
Model Deployment Strategies
Blue/Green Deployment
Deploy new version alongside old version
Shift traffic from old (blue) → new (green) gradually
SageMaker supports traffic shifting on endpoints via production variants
Easy rollback — switch traffic back to blue
predictor.update_endpoint(
initial_instance_count=1 ,
instance_type='ml.m5.xlarge' ,
update_endpoint_config=True
)
Canary Deployment
Send small % of traffic (e.g. 5%) to new model
Monitor metrics before scaling up
SageMaker endpoint variant weights: {"variant-A": 0.95, "variant-B": 0.05}
A/B Testing
Route traffic to two model variants simultaneously
Measure business metrics (click-through, conversion)
SageMaker Endpoint Production Variants — each variant = different model/instance
Shadow Testing (Challenger Model)
New model receives same traffic as production model
Shadow model's predictions are not served to users — only logged
Compare shadow vs production metrics without risk
SageMaker supports shadow variants natively
Lambda + SageMaker Patterns
Serverless Inference with Lambda
API Gateway → Lambda → SageMaker Endpoint → Lambda → Response
Lambda calls sagemaker-runtime.invoke_endpoint()
Good for: API-backed inference, custom auth/pre/post-processing
Lambda for Lightweight Models
For very simple models (rule-based, tiny sklearn) — run model inside Lambda with no SageMaker
Limitations: 10 GB memory, 15 min timeout, cold start
Step Functions for ML
Visual workflow orchestration alternative to SageMaker Pipelines
Pre-built Data Science SDK for Step Functions with SageMaker integration
Use when: need more complex orchestration, mix SageMaker + non-SageMaker steps
States: Task, Choice, Parallel, Wait, Fail, Succeed
Amazon EventBridge for ML Automation
Event Source Trigger Action S3 new object Start Glue ETL → SageMaker Pipeline SageMaker Training complete Lambda → evaluate → register model Model Registry status → Approved CodePipeline → deploy to staging CloudWatch alarm (drift detected) Trigger retraining pipeline Schedule (cron) Weekly retraining pipeline
Domain 3 — Exam Scenarios
Scenario Answer Live predictions, low latency Real-time Endpoint Score 10M records overnight Batch Transform Infrequent predictions, pay-per-use Serverless Inference Large document inference, 500MB payload Async Inference 1000 customer-specific models, cost-efficient Multi-Model Endpoint Test new model without user impact Shadow Testing (Challenger variant) Gradually shift traffic to new model Canary / Blue-Green deployment Automate train → evaluate → deploy pipeline SageMaker Pipelines Trigger retraining when new data arrives in S3 EventBridge → SageMaker Pipeline Optimise model for edge device SageMaker Neo Cheapest GPU option for inference Elastic Inference or Inferentia (inf1/inf2) Find best instance type for inference SageMaker Inference Recommender Version control + approval for models SageMaker Model Registry Reuse unchanged pipeline steps SageMaker Pipeline Caching CI/CD for ML on AWS CodePipeline + CodeBuild + SageMaker Pipelines