Domain 4 — Monitoring, Maintenance & Security (24%)
Domain 4 — Monitoring, Maintenance & Security (24%)
Focus: SageMaker Model Monitor, Clarify, drift detection, retraining triggers, IAM, VPC, encryption, and cost optimisation.
Why Models Degrade
graph LR
Deploy[Model Deployed] --> DriftD[Data Drift<br/>Input distribution changes]
Deploy --> ConceptD[Concept Drift<br/>Relationship changes]
Deploy --> QualD[Data Quality<br/>Missing / corrupt inputs]
Deploy --> BiasD[Bias Drift<br/>Fairness changes over time]
DriftD --> Retrain[Trigger Retraining]
ConceptD --> Retrain
QualD --> Alert[Alert + Fix Pipeline]
BiasD --> Retrain
style DriftD fill:#fef9c3,stroke:#ca8a04
style ConceptD fill:#fed7aa,stroke:#ea580c
style QualD fill:#fce7f3,stroke:#db2777
style BiasD fill:#f3e8ff,stroke:#9333ea
| Type | What Changes | Example |
|---|---|---|
| Data Drift (Covariate Shift) | Input feature distribution | User age distribution shifts with new market |
| Concept Drift (Label Drift) | Relationship between input and output | Fraud patterns change |
| Data Quality | Data format, missing values, type errors | Upstream pipeline bug sends nulls |
| Model Quality | Prediction accuracy degrades | Ground truth labels show model is wrong |
| Bias Drift | Fairness metrics change | Model becomes biased toward certain group |
SageMaker Model Monitor
Continuously monitors endpoints for quality issues.
Monitor Types
| Monitor | Detects | Baseline From |
|---|---|---|
| Data Quality Monitor | Feature distribution drift, missing values, type changes | Training data statistics + schema |
| Model Quality Monitor | Prediction accuracy degradation | Ground truth labels (if available) |
| Bias Drift Monitor | Bias metrics changing post-deployment | Clarify bias config |
| Feature Attribution Drift | SHAP values changing | Clarify explainability baseline |
How It Works
graph LR
EP[Endpoint<br/>Live Traffic] --> Cap[Data Capture<br/>Inputs + Outputs → S3]
Cap --> S3[S3<br/>Captured Data]
S3 --> MM[Model Monitor<br/>Processing Job]
MM --> Baseline[Compare vs<br/>Baseline]
Baseline -->|Violation| CW[CloudWatch<br/>Alarm]
CW -->|Trigger| EB[EventBridge]
EB --> RT[Retraining<br/>Pipeline]
Setup Steps
- Enable Data Capture on endpoint → logs requests/responses to S3
- Create Baseline — run baseline job on training data → generates statistics + constraints
- Create Monitoring Schedule — runs monitoring jobs periodically (e.g. hourly)
- Set CloudWatch Alarms on violations
- EventBridge Rule → trigger retraining pipeline on alarm
# Enable data capture
data_capture_config = DataCaptureConfig(
enable_capture=True,
sampling_percentage=100,
destination_s3_uri='s3://bucket/captures/'
)
# Create baseline
my_monitor = DefaultModelMonitor(role=role)
my_monitor.suggest_baseline(
baseline_dataset='s3://bucket/training-data/',
dataset_format=DatasetFormat.csv()
)
# Schedule monitoring
my_monitor.create_monitoring_schedule(
monitor_schedule_name='my-monitor',
endpoint_input='my-endpoint',
statistics=baseline_statistics,
constraints=baseline_constraints,
schedule_cron_expression='cron(0 * ? * * *)' # hourly
)
SageMaker Clarify
Detects bias and explains model predictions.
Pre-training Bias (Data Bias)
Run before training to assess if training data is fair.
Metrics:
| Metric | Meaning |
|---|---|
| Class Imbalance (CI) | One group is underrepresented |
| DPL (Difference in Proportions of Labels) | Positive outcome rate differs across groups |
| KL Divergence | Distribution difference between groups |
| Total Variation Distance (TVD) | Statistical distance between distributions |
Post-training Bias (Model Bias)
Run after training to assess if model treats groups differently.
Metrics:
| Metric | Meaning |
|---|---|
| DPPL | Difference in positive prediction rates |
| Equalised Odds | FPR and TPR difference across groups |
| Accuracy Difference | Model is more accurate for one group |
| Recall Difference | Model catches fewer positives for one group |
Explainability — SHAP Values
- SHAP (SHapley Additive exPlanations) — assigns importance scores to each feature for each prediction
- Global SHAP — average importance across all predictions
- Local SHAP — importance for a single prediction
- Outputs: feature importance chart, partial dependence plots
from sagemaker import clarify
clarify_processor = clarify.SageMakerClarifyProcessor(
role=role,
instance_count=1,
instance_type='ml.m5.xlarge'
)
explainability_config = clarify.ExplainabilityConfig(
shap_config=clarify.SHAPConfig(
baseline=[{'age': 30, 'income': 50000}],
num_samples=100,
agg_method='mean_abs'
)
)
Retraining Strategies
| Trigger | Method |
|---|---|
| Scheduled | Cron job via EventBridge — retrain weekly/monthly |
| Drift detected | Model Monitor alarm → EventBridge → SageMaker Pipeline |
| Accuracy threshold | Ground truth labels accumulate → evaluate → trigger if below threshold |
| New data volume | Lambda checks S3 for N new records → trigger |
| Manual | On-demand retraining |
Retraining Architecture
Model Monitor Alarm
↓
EventBridge Rule
↓
SageMaker Pipeline trigger
↓
ProcessingStep (new data prep) → TrainingStep → EvaluationStep
↓
ConditionStep (metrics > threshold?)
↓
RegisterModel → Approve → Deploy (Blue/Green)
Security in SageMaker
IAM for ML
graph TD
Dev[Data Scientist<br/>IAM User/Role] -->|assume| SR[SageMaker<br/>Execution Role]
SR --> S3[S3<br/>Read training data]
SR --> ECR[ECR<br/>Pull/Push containers]
SR --> CW[CloudWatch<br/>Write logs/metrics]
SR --> KMS[KMS<br/>Encrypt/decrypt]
SR --> SM2[SageMaker<br/>API calls]
Execution Role best practices:
- Least privilege — only permissions the training job needs
- Separate roles for: training, inference, pipeline execution
- Use IAM conditions to restrict to specific S3 buckets
Encryption
| Data | Encryption |
|---|---|
| S3 data (at rest) | SSE-S3, SSE-KMS, SSE-C |
| EBS volumes (training) | KMS CMK |
| Model artefacts (S3) | KMS CMK on S3 |
| Inter-node training traffic | Enable enable_inter_container_traffic_encryption=True |
| Data in transit | TLS (HTTPS) — all SageMaker API calls |
VPC for SageMaker
Run SageMaker workloads inside a VPC for network isolation:
# Training job in VPC
estimator = PyTorch(
...
subnets=['subnet-xxx'], # private subnets
security_group_ids=['sg-xxx'], # security group
)
- Private subnets — no internet access; instances use VPC endpoints to reach S3, ECR, etc.
- VPC Endpoints (PrivateLink) — access AWS services without traversing internet:
com.amazonaws.region.sagemaker.apicom.amazonaws.region.sagemaker.runtimecom.amazonaws.region.s3com.amazonaws.region.ecr.api
- Security Groups — control inbound/outbound traffic for training instances
SageMaker Role-Based Access Control
| Feature | Tool |
|---|---|
| Control who can create/delete endpoints | IAM policies |
| Control access to specific notebooks | IAM + resource tags |
| Isolate teams in Studio | SageMaker Studio Domains + IAM |
| Audit all API calls | CloudTrail |
Data Privacy
- SageMaker Ground Truth — anonymise data before sending to human labellers
- Differential Privacy — add noise to model training (not natively in SM, use TF Privacy)
- Federated Learning — train across distributed data without centralising it
Cost Optimisation
| Strategy | Saving |
|---|---|
| Spot instances for training | Up to 90% vs on-demand |
| Serverless inference for low traffic | Pay per invocation, no idle cost |
| Multi-Model Endpoints | One endpoint serves thousands of models |
| Right-sizing instances | Use Inference Recommender to find optimal |
| SageMaker Neo | Smaller, faster model = cheaper inference |
| Auto Scaling endpoints | Scale in during low traffic |
| Lifecycle configurations | Auto-stop idle notebooks |
| S3 Intelligent-Tiering | Auto-move infrequently used data to cheaper tier |
| Graviton instances | 40% better price/performance for CPU workloads |
| Delete unused endpoints | Running endpoints = ongoing cost even if no traffic |
Cost Monitoring
- AWS Cost Explorer — track SageMaker spending by service/tag
- AWS Budgets — set alerts when spending exceeds threshold
- Tag everything — project, team, environment tags for cost allocation
- SageMaker Savings Plans — commit to $/hour usage for discounts
CloudWatch for ML
| Metric | What It Tracks |
|---|---|
Invocations | Number of inference requests |
InvocationErrors | Failed inference calls |
ModelLatency | Time model takes to generate prediction |
OverheadLatency | SageMaker infrastructure overhead |
CPUUtilization | Training/inference instance CPU |
MemoryUtilization | RAM usage |
GPUUtilization | GPU usage |
DiskUtilization | Storage usage |
Custom Metrics from Training:
# In training script — SageMaker captures these via CloudWatch
print(f"train:loss={loss:.4f}")
print(f"validation:accuracy={accuracy:.4f}")
Regex in Estimator to capture custom metrics:
metric_definitions = [
{'Name': 'train:loss', 'Regex': 'train:loss=([0-9\\.]+)'},
{'Name': 'validation:accuracy', 'Regex': 'validation:accuracy=([0-9\\.]+)'}
]
Domain 4 — Exam Scenarios
| Scenario | Answer |
|---|---|
| Detect feature distribution shift post-deploy | SageMaker Model Monitor (Data Quality) |
| Detect accuracy degradation with ground truth | SageMaker Model Monitor (Model Quality) |
| Detect if model is becoming biased over time | SageMaker Model Monitor (Bias Drift) |
| Explain why a specific prediction was made | SageMaker Clarify (local SHAP) |
| Find most important features globally | SageMaker Clarify (global SHAP) |
| Detect bias before training begins | SageMaker Clarify (pre-training bias) |
| Auto-trigger retraining on drift | Model Monitor → CloudWatch Alarm → EventBridge → Pipeline |
| Secure training data in transit between nodes | Enable inter-container traffic encryption |
| SageMaker access to S3 without internet | VPC Endpoints (PrivateLink) |
| Audit all SageMaker API calls | CloudTrail |
| Reduce cost for rarely-used models | Multi-Model Endpoint |
| Reduce endpoint cost during off-hours | Auto Scaling + scale to 0 (serverless) |
| Encrypt model artefacts in S3 | S3 SSE-KMS with CMK |
| Restrict data scientist to specific S3 bucket | IAM policy with resource condition |
| Monitor training in real-time | CloudWatch metrics + SageMaker Experiments |