Back to Notes

ML Fundamentals — AWS ML Associate

ML Fundamentals

Core ML concepts tested across all domains of MLA-C01.


Types of Machine Learning

graph TD
    ML[Machine Learning] --> SL[Supervised Learning<br/>Labeled data]
    ML --> UL[Unsupervised Learning<br/>No labels]
    ML --> RL[Reinforcement Learning<br/>Agent + Environment]
    ML --> SSL[Semi-Supervised<br/>Some labels]

    SL --> CL[Classification<br/>Predict category]
    SL --> RE[Regression<br/>Predict number]
    UL --> CL2[Clustering<br/>Group similar data]
    UL --> DR[Dimensionality Reduction<br/>Compress features]
    UL --> AD[Anomaly Detection<br/>Find outliers]

    style ML fill:#dbeafe,stroke:#3b82f6
    style SL fill:#dcfce7,stroke:#16a34a
    style UL fill:#fef9c3,stroke:#ca8a04
    style RL fill:#fed7aa,stroke:#ea580c
TypeLabels?GoalExamples
Supervised✅ YesLearn input→output mappingEmail spam, house price
Unsupervised❌ NoFind hidden structureCustomer segmentation, anomaly detection
ReinforcementRewardsMaximize cumulative rewardGame playing, robotics, recommendation
Semi-supervisedPartialUse few labels + many unlabeledFraud detection with rare labeled fraud

The ML Workflow

graph LR
    P[Problem Definition] --> D[Data Collection]
    D --> E[EDA + Cleaning]
    E --> F[Feature Engineering]
    F --> M[Model Selection + Training]
    M --> V[Evaluation]
    V -->|Not good enough| F
    V -->|Good enough| Dep[Deployment]
    Dep --> Mon[Monitoring]
    Mon -->|Drift detected| M

Bias-Variance Trade-off

Total Error = Bias² + Variance + Irreducible Noise
BiasVarianceResult
UnderfittingHighLowModel too simple, misses patterns
OverfittingLowHighModel memorises training data, fails on new data
Sweet spotLowLowGeneralises well

How to fix:

ProblemFix
UnderfittingMore complex model, more features, reduce regularisation, train longer
OverfittingMore data, regularisation (L1/L2), dropout, early stopping, simpler model, cross-validation

Regularisation

Prevents overfitting by penalising large weights.

TypeFormulaEffectUse When
L1 (Lasso)Loss + λΣ|w|Drives weights to exactly 0 → feature selectionMany irrelevant features
L2 (Ridge)Loss + λΣw²Shrinks weights, rarely to 0 → keeps all featuresCorrelated features
Elastic NetL1 + L2 combinedBalance between bothDefault when unsure

λ (lambda) = regularisation strength. Higher λ = more regularisation = simpler model.


Evaluation Metrics

Classification

Confusion Matrix:

                Predicted Positive  Predicted Negative
Actual Positive      TP                  FN
Actual Negative      FP                  TN
MetricFormulaUse When
Accuracy(TP+TN) / TotalBalanced classes
PrecisionTP / (TP+FP)Cost of false positives is high (spam filter)
Recall (Sensitivity)TP / (TP+FN)Cost of false negatives is high (cancer detection)
F1 Score2 × (P×R)/(P+R)Imbalanced classes, balance precision/recall
AUC-ROCArea under ROC curveOverall discriminative ability, threshold-independent
SpecificityTN / (TN+FP)True negative rate

AUC = 0.5 → no better than random. AUC = 1.0 → perfect.

When to use what:

  • Cancer screening → maximise Recall (catch all positives, FN = missed cancer)
  • Spam filter → maximise Precision (don't block legit emails, FP = legit marked spam)
  • Fraud detection → F1 (balance both, classes are highly imbalanced)

Regression

MetricFormulaNotes
MAEMean |actual - predicted|Robust to outliers, interpretable (same units)
MSEMean (actual - predicted)²Penalises large errors heavily
RMSE√MSESame units as target, penalises outliers
1 - (SS_res / SS_tot)0–1, how much variance explained (higher better)
MAPEMean |(actual-pred)/actual| × 100Percentage error, intuitive

Ranking / Recommendation

MetricUse
NDCGQuality of ranked results
MAPMean Average Precision across queries
Hit RateDid top-N recommendations include relevant item?

Cross-Validation

Reduces overfitting to a specific train/test split.

K-Fold Cross-Validation

  • Split data into K equal folds
  • Train on K-1 folds, validate on remaining 1
  • Repeat K times, average results
  • Standard K = 5 or 10

Stratified K-Fold

  • Same as K-fold but maintains class proportion in each fold
  • Use for imbalanced classification

Leave-One-Out (LOO)

  • K = N (every sample is a test set once)
  • Very expensive, use only for tiny datasets

Data Splits

SplitTypical %Purpose
Training60–80%Model learns from this
Validation10–20%Tune hyperparameters, early stopping
Test10–20%Final unbiased evaluation — touch once

Never tune on test set. Never look at test set during development.


Algorithms Cheat Sheet

Supervised — Classification

AlgorithmTypeStrengthsWeaknesses
Logistic RegressionLinearFast, interpretable, probabilistic outputCan't model non-linear boundaries
Decision TreeTreeInterpretable, handles non-linearOverfits easily
Random ForestEnsemble (Bagging)Robust, reduces variance, handles missing valuesSlower, less interpretable
XGBoost / GBMEnsemble (Boosting)Best performance on tabular data, handles missingNeeds tuning, can overfit
SVMKernelEffective in high dimensionsSlow on large datasets, needs scaling
K-NNInstance-basedSimple, no trainingSlow at inference, needs feature scaling
Naive BayesProbabilisticVery fast, great for textFeature independence assumption

Supervised — Regression

AlgorithmNotes
Linear RegressionFast, interpretable baseline
Ridge / LassoLinear + regularisation
Random Forest RegressorNon-linear, robust
XGBoost RegressorUsually best on tabular
Neural NetworkNon-linear, needs lots of data

Unsupervised

AlgorithmTypeNotes
K-MeansClusteringMust specify K, spherical clusters, Euclidean distance
DBSCANClusteringFinds arbitrary shapes, handles noise/outliers
HierarchicalClusteringDendrogram, no K needed
PCADimensionality ReductionLinear, maximises variance retained
t-SNEDimensionality ReductionNon-linear, for visualisation only (not used for training)
AutoencodersDimensionality ReductionNeural network-based, non-linear
Isolation ForestAnomaly DetectionIsolates outliers via random splits

Deep Learning

ArchitectureBest For
MLP (Dense layers)Tabular data
CNNImages, spatial patterns
RNN / LSTM / GRUSequential data, time series, NLP (older)
TransformerNLP, images (ViT), state of art
AutoencoderAnomaly detection, compression
GANImage generation
Diffusion ModelHigh-quality image generation

Feature Engineering

Handling Missing Values

MethodWhen to Use
Drop rowsMissing at random, small %
Drop column>50% missing, column not important
Mean/Median imputationNumerical, not skewed distribution
Mode imputationCategorical
Forward/Backward fillTime series
Model-based imputationComplex missing patterns
Add "missing" indicator columnMissingness itself is informative

Encoding Categorical Variables

MethodUse When
One-Hot EncodingLow cardinality (< 10-15 categories)
Label EncodingOrdinal categories (small, medium, large)
Target EncodingHigh cardinality, avoid leakage
EmbeddingHigh cardinality in deep learning

Scaling / Normalisation

MethodFormulaUse When
Min-Max Scaling(x - min) / (max - min) → [0,1]Bounded range, neural networks
Standardisation (Z-score)(x - mean) / stdNormal distribution, SVM, PCA
Log Transformlog(x)Skewed distribution, right-tailed
Robust ScalerUses IQROutliers present

Always fit scaler on training set only, then transform train + test. Never fit on full dataset — leakage.

Feature Selection

MethodType
Correlation matrixFilter — remove highly correlated features
Chi-squared testFilter — categorical features vs target
L1 regularisation (Lasso)Embedded — zeroes out unimportant features
Feature importance (tree models)Embedded — from Random Forest / XGBoost
Recursive Feature Elimination (RFE)Wrapper — iteratively removes weakest
PCADimensionality reduction — not true selection

Class Imbalance

MethodNotes
Oversample minority (SMOTE)Synthetic minority oversampling
Undersample majorityLose data
Class weightsPenalise misclassification of minority more
Threshold tuningAdjust decision threshold from 0.5
Anomaly detection framingTreat minority class as anomaly
Collect more minority dataBest solution if possible

Hyperparameters vs Parameters

ParametersHyperparameters
WhatLearned from data (weights, biases)Set before training
Who setsTraining algorithmData scientist / HPO
ExamplesNeural net weights, decision tree splitsLearning rate, tree depth, num_estimators, epochs

Common Hyperparameters

HyperparameterEffect of increasing
Learning rateFaster but unstable; too high = diverges
Batch sizeMore stable gradients but slower; needs more memory
Epochs / iterationsMore learning; risk of overfitting
Tree depthMore complex model; risk of overfitting
n_estimators (trees)Better ensemble; diminishing returns
Dropout rateMore regularisation; too high = underfitting
L1/L2 lambdaMore regularisation; too high = underfitting

Neural Network Basics

Input Layer → [Hidden Layers] → Output Layer
Each neuron: z = Wx + b → a = activation(z)

Activation Functions

FunctionRangeUse
ReLU[0, ∞)Default hidden layers, fast
Leaky ReLU(-∞, ∞)Prevents dying ReLU
Sigmoid(0, 1)Binary classification output
Softmax(0,1), sums to 1Multi-class output
Tanh(-1, 1)Hidden layers, zero-centred

Optimisers

OptimiserNotes
SGDSimple, may be slow, needs LR tuning
AdamAdaptive LR, fast convergence, default choice
RMSpropGood for RNNs
AdaGradSparse gradients, NLP

Loss Functions

TaskLoss
Binary classificationBinary cross-entropy
Multi-class classificationCategorical cross-entropy
RegressionMSE / MAE / Huber
Object detectionFocal loss

Transfer Learning

  • Take a model pre-trained on large dataset → fine-tune on your smaller dataset
  • Fine-tuning: unfreeze some or all layers, train with small learning rate
  • Feature extraction: freeze all layers, only train new classification head
  • Use when: limited labelled data, similar domain to pre-trained task

Ensemble Methods

MethodHowExamples
BaggingTrain N models on bootstrap samples, averageRandom Forest
BoostingTrain models sequentially, each corrects previous errorsXGBoost, AdaBoost, LightGBM
StackingTrain meta-model on predictions of base modelsCustom ensembles
VotingHard: majority vote; Soft: average probabilitiesSimple ensembles

XGBoost dominates tabular data on AWS exam scenarios. When in doubt for structured/tabular = XGBoost.