Domain 3 — Deployment & Orchestration of ML Workflows (22%)

Focus: SageMaker endpoint types, inference patterns, SageMaker Pipelines, CI/CD for ML, and MLOps best practices.

SageMaker Inference — Endpoint Types

graph TD
    Inference[Inference Options] --> RT[Real-time Endpoint<br/>Low latency<br/>Always on]
    Inference --> BT[Batch Transform<br/>Bulk offline<br/>No endpoint]
    Inference --> SL[Serverless Inference<br/>Spiky / infrequent<br/>Scale to zero]
    Inference --> AI[Async Inference<br/>Large payloads<br/>Queue-based]
    Inference --> MMS[Multi-Model Endpoint<br/>Many models, one endpoint<br/>Cost efficient]

    style RT fill:#dcfce7,stroke:#16a34a
    style BT fill:#fef9c3,stroke:#ca8a04
    style SL fill:#fed7aa,stroke:#ea580c
    style AI fill:#f3e8ff,stroke:#9333ea
    style MMS fill:#fce7f3,stroke:#db2777

1. Real-time Endpoint

Persistent endpoint, always-on instances
Latency: ms-level
Payload: up to 6 MB
Timeout: 60 seconds
Use: live predictions, user-facing APIs, fraud detection
Auto-scales via Application Auto Scaling (scale on SageMaker:Invocations metric)

predictor = estimator.deploy(
    initial_instance_count=2,   # min 2 for HA across AZs
    instance_type='ml.m5.xlarge'
)
response = predictor.predict(data)

2. Batch Transform

No persistent endpoint — spins up instances, runs inference, shuts down
Input: S3 files → Output: S3 files
Handles large files automatically via SplitType and AssembleWith
Use: offline scoring, pre-computing predictions, large datasets
Most cost-effective for non-real-time bulk inference

transformer = model.transformer(
    instance_count=1,
    instance_type='ml.m5.xlarge',
    output_path='s3://bucket/predictions/'
)
transformer.transform(
    data='s3://bucket/test-data/',
    split_type='Line',
    content_type='text/csv'
)

3. Serverless Inference

No instances to manage — scales to zero when idle
Cold start latency (seconds) on first request
Memory: 1024 MB – 6144 MB
Payload: up to 4 MB
Timeout: 60 seconds
Billed per invocation + processing time (no idle cost)
Use: spiky/infrequent traffic, dev/test, batch of occasional requests

4. Async Inference

Requests are queued — response is written to S3 asynchronously
Payload: up to 1 GB (largest of all endpoint types)
Timeout: up to 15 minutes
Auto-scales to zero when queue is empty
SNS/S3 notification when complete
Use: large payloads (documents, video), long inference time, variable load

5. Multi-Model Endpoint (MME)

Host thousands of models behind one endpoint
Models loaded/evicted dynamically from S3 into memory (LRU cache)
Reduces cost — no dedicated endpoint per model
Use: per-customer models, A/B variants, many similar models
Supports: built-in algorithms, custom containers

6. Multi-Container Endpoint

Run different model types in separate containers on same endpoint
Each container handles different inference steps
Use: ensembles, pre/post-processing pipelines, different frameworks

Endpoint Comparison

Type	Latency	Max Payload	Cost Model	Use Case
Real-time	ms	6 MB	Per-hour	Live predictions
Batch Transform	Minutes	Unlimited	Per-job	Bulk offline scoring
Serverless	ms-seconds	4 MB	Per-invocation	Sporadic traffic
Async	Seconds-minutes	1 GB	Per-hour + queue	Large payloads
Multi-Model	ms + cold start	6 MB	Per-hour (shared)	Many models, cost savings

Inference Optimisation

SageMaker Inference Recommender

Benchmarks your model across instance types → recommends best cost/performance
Default job: 45-minute benchmark across SageMaker-recommended instances
Advanced job: custom load test against your specific traffic pattern

SageMaker Neo

Compile and optimise models for specific hardware targets
Reduces model size, increases inference speed (up to 25×)
Targets: cloud (ml.c5, ml.p3), edge (Jetson, Raspberry Pi, Greengrass)
Supports: TensorFlow, PyTorch, MXNet, ONNX, XGBoost

Elastic Inference (EI)

Attach fractional GPU (for inference only) to CPU instance
Lower cost than full GPU instance for inference workloads
Now largely superseded by Inferentia chips

AWS Inferentia / Trainium

Inferentia: AWS-custom chip optimised for inference — up to 70% cost reduction
Trainium: AWS-custom chip for training — most cost-efficient for large models
Use with inf1, inf2, trn1 instance types

SageMaker Pipelines — MLOps

Automated, repeatable ML workflows with DAG-based steps.

graph LR
    PP[Processing Step<br/>Data prep] --> TS[Training Step<br/>Train model]
    TS --> ES[Evaluation Step<br/>Run metrics]
    ES -->|metrics pass| RS[Register Step<br/>Model Registry]
    ES -->|metrics fail| Fail[Fail Step]
    RS --> DS[Deploy Step]

    style PP fill:#dbeafe,stroke:#3b82f6
    style TS fill:#dcfce7,stroke:#16a34a
    style ES fill:#fef9c3,stroke:#ca8a04
    style RS fill:#f3e8ff,stroke:#9333ea

Pipeline Steps

Step	Purpose
`ProcessingStep`	Data preprocessing, feature engineering, evaluation
`TrainingStep`	Train a model
`TuningStep`	Hyperparameter tuning
`TransformStep`	Batch transform / inference
`RegisterModel`	Register to Model Registry
`CreateModel`	Create deployable model
`ConditionStep`	Branching logic (if/else) based on metrics
`LambdaStep`	Run Lambda function
`FailStep`	Stop pipeline with error message

Pipeline Parameters

from sagemaker.workflow.parameters import ParameterString, ParameterFloat

training_instance_type = ParameterString(
    name='TrainingInstanceType',
    default_value='ml.m5.xlarge'
)
accuracy_threshold = ParameterFloat(name='AccuracyThreshold', default_value=0.85)

Pipeline Caching

Steps can be cached — if input data + config unchanged, skip re-running
Reduces cost and time in iterative development
Set cache_config=CacheConfig(enable_caching=True, expire_after='PT1H')

CI/CD for ML

graph LR
    Code[Code Push<br/>GitHub] --> CB[CodeBuild<br/>Test + Build]
    CB --> CP[CodePipeline<br/>Orchestrate]
    CP --> SM[SageMaker Pipeline<br/>Train + Evaluate]
    SM -->|Approved| Deploy[Deploy to Staging]
    Deploy -->|Tests pass| Prod[Deploy to Production]

    style Code fill:#dbeafe,stroke:#3b82f6
    style SM fill:#dcfce7,stroke:#16a34a
    style Prod fill:#fed7aa,stroke:#ea580c

Key Services

CodePipeline — orchestrates the CI/CD pipeline
CodeBuild — runs tests, builds containers, pushes to ECR
CodeCommit / GitHub / Bitbucket — source control
EventBridge — trigger pipeline on: new data in S3, model approval, schedule
Lambda — lightweight automation steps

MLOps Maturity Levels

Level	Description
Level 0	Manual process — no automation
Level 1	Automated training pipeline, manual deployment
Level 2	Fully automated CI/CD for training + deployment

Model Deployment Strategies

Blue/Green Deployment

Deploy new version alongside old version
Shift traffic from old (blue) → new (green) gradually
SageMaker supports traffic shifting on endpoints via production variants
Easy rollback — switch traffic back to blue

# Two production variants on one endpoint
predictor.update_endpoint(
    initial_instance_count=1,
    instance_type='ml.m5.xlarge',
    update_endpoint_config=True
)

Canary Deployment

Send small % of traffic (e.g. 5%) to new model
Monitor metrics before scaling up
SageMaker endpoint variant weights: {"variant-A": 0.95, "variant-B": 0.05}

A/B Testing

Route traffic to two model variants simultaneously
Measure business metrics (click-through, conversion)
SageMaker Endpoint Production Variants — each variant = different model/instance

Shadow Testing (Challenger Model)

New model receives same traffic as production model
Shadow model's predictions are not served to users — only logged
Compare shadow vs production metrics without risk
SageMaker supports shadow variants natively

Lambda + SageMaker Patterns

Serverless Inference with Lambda

API Gateway → Lambda → SageMaker Endpoint → Lambda → Response

Lambda calls sagemaker-runtime.invoke_endpoint()
Good for: API-backed inference, custom auth/pre/post-processing

Lambda for Lightweight Models

For very simple models (rule-based, tiny sklearn) — run model inside Lambda with no SageMaker
Limitations: 10 GB memory, 15 min timeout, cold start

Step Functions for ML

Visual workflow orchestration alternative to SageMaker Pipelines
Pre-built Data Science SDK for Step Functions with SageMaker integration
Use when: need more complex orchestration, mix SageMaker + non-SageMaker steps
States: Task, Choice, Parallel, Wait, Fail, Succeed

Amazon EventBridge for ML Automation

Event Source	Trigger Action
S3 new object	Start Glue ETL → SageMaker Pipeline
SageMaker Training complete	Lambda → evaluate → register model
Model Registry status → Approved	CodePipeline → deploy to staging
CloudWatch alarm (drift detected)	Trigger retraining pipeline
Schedule (cron)	Weekly retraining pipeline

Domain 3 — Exam Scenarios

Scenario	Answer
Live predictions, low latency	Real-time Endpoint
Score 10M records overnight	Batch Transform
Infrequent predictions, pay-per-use	Serverless Inference
Large document inference, 500MB payload	Async Inference
1000 customer-specific models, cost-efficient	Multi-Model Endpoint
Test new model without user impact	Shadow Testing (Challenger variant)
Gradually shift traffic to new model	Canary / Blue-Green deployment
Automate train → evaluate → deploy pipeline	SageMaker Pipelines
Trigger retraining when new data arrives in S3	EventBridge → SageMaker Pipeline
Optimise model for edge device	SageMaker Neo
Cheapest GPU option for inference	Elastic Inference or Inferentia (inf1/inf2)
Find best instance type for inference	SageMaker Inference Recommender
Version control + approval for models	SageMaker Model Registry
Reuse unchanged pipeline steps	SageMaker Pipeline Caching
CI/CD for ML on AWS	CodePipeline + CodeBuild + SageMaker Pipelines