Back to Notes

Domain 4 — Monitoring, Maintenance & Security (24%)

Domain 4 — Monitoring, Maintenance & Security (24%)

Focus: SageMaker Model Monitor, Clarify, drift detection, retraining triggers, IAM, VPC, encryption, and cost optimisation.


Why Models Degrade

graph LR
    Deploy[Model Deployed] --> DriftD[Data Drift<br/>Input distribution changes]
    Deploy --> ConceptD[Concept Drift<br/>Relationship changes]
    Deploy --> QualD[Data Quality<br/>Missing / corrupt inputs]
    Deploy --> BiasD[Bias Drift<br/>Fairness changes over time]

    DriftD --> Retrain[Trigger Retraining]
    ConceptD --> Retrain
    QualD --> Alert[Alert + Fix Pipeline]
    BiasD --> Retrain

    style DriftD fill:#fef9c3,stroke:#ca8a04
    style ConceptD fill:#fed7aa,stroke:#ea580c
    style QualD fill:#fce7f3,stroke:#db2777
    style BiasD fill:#f3e8ff,stroke:#9333ea
TypeWhat ChangesExample
Data Drift (Covariate Shift)Input feature distributionUser age distribution shifts with new market
Concept Drift (Label Drift)Relationship between input and outputFraud patterns change
Data QualityData format, missing values, type errorsUpstream pipeline bug sends nulls
Model QualityPrediction accuracy degradesGround truth labels show model is wrong
Bias DriftFairness metrics changeModel becomes biased toward certain group

SageMaker Model Monitor

Continuously monitors endpoints for quality issues.

Monitor Types

MonitorDetectsBaseline From
Data Quality MonitorFeature distribution drift, missing values, type changesTraining data statistics + schema
Model Quality MonitorPrediction accuracy degradationGround truth labels (if available)
Bias Drift MonitorBias metrics changing post-deploymentClarify bias config
Feature Attribution DriftSHAP values changingClarify explainability baseline

How It Works

graph LR
    EP[Endpoint<br/>Live Traffic] --> Cap[Data Capture<br/>Inputs + Outputs → S3]
    Cap --> S3[S3<br/>Captured Data]
    S3 --> MM[Model Monitor<br/>Processing Job]
    MM --> Baseline[Compare vs<br/>Baseline]
    Baseline -->|Violation| CW[CloudWatch<br/>Alarm]
    CW -->|Trigger| EB[EventBridge]
    EB --> RT[Retraining<br/>Pipeline]

Setup Steps

  1. Enable Data Capture on endpoint → logs requests/responses to S3
  2. Create Baseline — run baseline job on training data → generates statistics + constraints
  3. Create Monitoring Schedule — runs monitoring jobs periodically (e.g. hourly)
  4. Set CloudWatch Alarms on violations
  5. EventBridge Rule → trigger retraining pipeline on alarm
# Enable data capture
data_capture_config = DataCaptureConfig(
    enable_capture=True,
    sampling_percentage=100,
    destination_s3_uri='s3://bucket/captures/'
)

# Create baseline
my_monitor = DefaultModelMonitor(role=role)
my_monitor.suggest_baseline(
    baseline_dataset='s3://bucket/training-data/',
    dataset_format=DatasetFormat.csv()
)

# Schedule monitoring
my_monitor.create_monitoring_schedule(
    monitor_schedule_name='my-monitor',
    endpoint_input='my-endpoint',
    statistics=baseline_statistics,
    constraints=baseline_constraints,
    schedule_cron_expression='cron(0 * ? * * *)'  # hourly
)

SageMaker Clarify

Detects bias and explains model predictions.

Pre-training Bias (Data Bias)

Run before training to assess if training data is fair.

Metrics:

MetricMeaning
Class Imbalance (CI)One group is underrepresented
DPL (Difference in Proportions of Labels)Positive outcome rate differs across groups
KL DivergenceDistribution difference between groups
Total Variation Distance (TVD)Statistical distance between distributions

Post-training Bias (Model Bias)

Run after training to assess if model treats groups differently.

Metrics:

MetricMeaning
DPPLDifference in positive prediction rates
Equalised OddsFPR and TPR difference across groups
Accuracy DifferenceModel is more accurate for one group
Recall DifferenceModel catches fewer positives for one group

Explainability — SHAP Values

  • SHAP (SHapley Additive exPlanations) — assigns importance scores to each feature for each prediction
  • Global SHAP — average importance across all predictions
  • Local SHAP — importance for a single prediction
  • Outputs: feature importance chart, partial dependence plots
from sagemaker import clarify

clarify_processor = clarify.SageMakerClarifyProcessor(
    role=role,
    instance_count=1,
    instance_type='ml.m5.xlarge'
)

explainability_config = clarify.ExplainabilityConfig(
    shap_config=clarify.SHAPConfig(
        baseline=[{'age': 30, 'income': 50000}],
        num_samples=100,
        agg_method='mean_abs'
    )
)

Retraining Strategies

TriggerMethod
ScheduledCron job via EventBridge — retrain weekly/monthly
Drift detectedModel Monitor alarm → EventBridge → SageMaker Pipeline
Accuracy thresholdGround truth labels accumulate → evaluate → trigger if below threshold
New data volumeLambda checks S3 for N new records → trigger
ManualOn-demand retraining

Retraining Architecture

Model Monitor Alarm
        ↓
EventBridge Rule
        ↓
SageMaker Pipeline trigger
        ↓
ProcessingStep (new data prep) → TrainingStep → EvaluationStep
        ↓
ConditionStep (metrics > threshold?)
        ↓
RegisterModel → Approve → Deploy (Blue/Green)

Security in SageMaker

IAM for ML

graph TD
    Dev[Data Scientist<br/>IAM User/Role] -->|assume| SR[SageMaker<br/>Execution Role]
    SR --> S3[S3<br/>Read training data]
    SR --> ECR[ECR<br/>Pull/Push containers]
    SR --> CW[CloudWatch<br/>Write logs/metrics]
    SR --> KMS[KMS<br/>Encrypt/decrypt]
    SR --> SM2[SageMaker<br/>API calls]

Execution Role best practices:

  • Least privilege — only permissions the training job needs
  • Separate roles for: training, inference, pipeline execution
  • Use IAM conditions to restrict to specific S3 buckets

Encryption

DataEncryption
S3 data (at rest)SSE-S3, SSE-KMS, SSE-C
EBS volumes (training)KMS CMK
Model artefacts (S3)KMS CMK on S3
Inter-node training trafficEnable enable_inter_container_traffic_encryption=True
Data in transitTLS (HTTPS) — all SageMaker API calls

VPC for SageMaker

Run SageMaker workloads inside a VPC for network isolation:

# Training job in VPC
estimator = PyTorch(
    ...
    subnets=['subnet-xxx'],          # private subnets
    security_group_ids=['sg-xxx'],   # security group
)
  • Private subnets — no internet access; instances use VPC endpoints to reach S3, ECR, etc.
  • VPC Endpoints (PrivateLink) — access AWS services without traversing internet:
    • com.amazonaws.region.sagemaker.api
    • com.amazonaws.region.sagemaker.runtime
    • com.amazonaws.region.s3
    • com.amazonaws.region.ecr.api
  • Security Groups — control inbound/outbound traffic for training instances

SageMaker Role-Based Access Control

FeatureTool
Control who can create/delete endpointsIAM policies
Control access to specific notebooksIAM + resource tags
Isolate teams in StudioSageMaker Studio Domains + IAM
Audit all API callsCloudTrail

Data Privacy

  • SageMaker Ground Truth — anonymise data before sending to human labellers
  • Differential Privacy — add noise to model training (not natively in SM, use TF Privacy)
  • Federated Learning — train across distributed data without centralising it

Cost Optimisation

StrategySaving
Spot instances for trainingUp to 90% vs on-demand
Serverless inference for low trafficPay per invocation, no idle cost
Multi-Model EndpointsOne endpoint serves thousands of models
Right-sizing instancesUse Inference Recommender to find optimal
SageMaker NeoSmaller, faster model = cheaper inference
Auto Scaling endpointsScale in during low traffic
Lifecycle configurationsAuto-stop idle notebooks
S3 Intelligent-TieringAuto-move infrequently used data to cheaper tier
Graviton instances40% better price/performance for CPU workloads
Delete unused endpointsRunning endpoints = ongoing cost even if no traffic

Cost Monitoring

  • AWS Cost Explorer — track SageMaker spending by service/tag
  • AWS Budgets — set alerts when spending exceeds threshold
  • Tag everything — project, team, environment tags for cost allocation
  • SageMaker Savings Plans — commit to $/hour usage for discounts

CloudWatch for ML

MetricWhat It Tracks
InvocationsNumber of inference requests
InvocationErrorsFailed inference calls
ModelLatencyTime model takes to generate prediction
OverheadLatencySageMaker infrastructure overhead
CPUUtilizationTraining/inference instance CPU
MemoryUtilizationRAM usage
GPUUtilizationGPU usage
DiskUtilizationStorage usage

Custom Metrics from Training:

# In training script — SageMaker captures these via CloudWatch
print(f"train:loss={loss:.4f}")
print(f"validation:accuracy={accuracy:.4f}")

Regex in Estimator to capture custom metrics:

metric_definitions = [
    {'Name': 'train:loss', 'Regex': 'train:loss=([0-9\\.]+)'},
    {'Name': 'validation:accuracy', 'Regex': 'validation:accuracy=([0-9\\.]+)'}
]

Domain 4 — Exam Scenarios

ScenarioAnswer
Detect feature distribution shift post-deploySageMaker Model Monitor (Data Quality)
Detect accuracy degradation with ground truthSageMaker Model Monitor (Model Quality)
Detect if model is becoming biased over timeSageMaker Model Monitor (Bias Drift)
Explain why a specific prediction was madeSageMaker Clarify (local SHAP)
Find most important features globallySageMaker Clarify (global SHAP)
Detect bias before training beginsSageMaker Clarify (pre-training bias)
Auto-trigger retraining on driftModel Monitor → CloudWatch Alarm → EventBridge → Pipeline
Secure training data in transit between nodesEnable inter-container traffic encryption
SageMaker access to S3 without internetVPC Endpoints (PrivateLink)
Audit all SageMaker API callsCloudTrail
Reduce cost for rarely-used modelsMulti-Model Endpoint
Reduce endpoint cost during off-hoursAuto Scaling + scale to 0 (serverless)
Encrypt model artefacts in S3S3 SSE-KMS with CMK
Restrict data scientist to specific S3 bucketIAM policy with resource condition
Monitor training in real-timeCloudWatch metrics + SageMaker Experiments