ML Fundamentals
Core ML concepts tested across all domains of MLA-C01.
Types of Machine Learning
graph TD
ML[Machine Learning] --> SL[Supervised Learning<br/>Labeled data]
ML --> UL[Unsupervised Learning<br/>No labels]
ML --> RL[Reinforcement Learning<br/>Agent + Environment]
ML --> SSL[Semi-Supervised<br/>Some labels]
SL --> CL[Classification<br/>Predict category]
SL --> RE[Regression<br/>Predict number]
UL --> CL2[Clustering<br/>Group similar data]
UL --> DR[Dimensionality Reduction<br/>Compress features]
UL --> AD[Anomaly Detection<br/>Find outliers]
style ML fill:#dbeafe,stroke:#3b82f6
style SL fill:#dcfce7,stroke:#16a34a
style UL fill:#fef9c3,stroke:#ca8a04
style RL fill:#fed7aa,stroke:#ea580c
Type Labels? Goal Examples Supervised ✅ Yes Learn input→output mapping Email spam, house price Unsupervised ❌ No Find hidden structure Customer segmentation, anomaly detection Reinforcement Rewards Maximize cumulative reward Game playing, robotics, recommendation Semi-supervised Partial Use few labels + many unlabeled Fraud detection with rare labeled fraud
The ML Workflow
graph LR
P[Problem Definition] --> D[Data Collection]
D --> E[EDA + Cleaning]
E --> F[Feature Engineering]
F --> M[Model Selection + Training]
M --> V[Evaluation]
V -->|Not good enough| F
V -->|Good enough| Dep[Deployment]
Dep --> Mon[Monitoring]
Mon -->|Drift detected| M
Bias-Variance Trade-off
Total Error = Bias² + Variance + Irreducible Noise
Bias Variance Result Underfitting High Low Model too simple, misses patterns Overfitting Low High Model memorises training data, fails on new data Sweet spot Low Low Generalises well
How to fix:
Problem Fix Underfitting More complex model, more features, reduce regularisation, train longer Overfitting More data, regularisation (L1/L2), dropout, early stopping, simpler model, cross-validation
Regularisation
Prevents overfitting by penalising large weights.
Type Formula Effect Use When L1 (Lasso) Loss + λΣ|w| Drives weights to exactly 0 → feature selection Many irrelevant features L2 (Ridge) Loss + λΣw² Shrinks weights, rarely to 0 → keeps all features Correlated features Elastic Net L1 + L2 combined Balance between both Default when unsure
λ (lambda) = regularisation strength. Higher λ = more regularisation = simpler model.
Evaluation Metrics
Classification
Confusion Matrix:
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
Metric Formula Use When Accuracy (TP+TN) / Total Balanced classes Precision TP / (TP+FP) Cost of false positives is high (spam filter) Recall (Sensitivity) TP / (TP+FN) Cost of false negatives is high (cancer detection) F1 Score 2 × (P×R)/(P+R) Imbalanced classes, balance precision/recall AUC-ROC Area under ROC curve Overall discriminative ability, threshold-independent Specificity TN / (TN+FP) True negative rate
AUC = 0.5 → no better than random. AUC = 1.0 → perfect.
When to use what:
Cancer screening → maximise Recall (catch all positives, FN = missed cancer)
Spam filter → maximise Precision (don't block legit emails, FP = legit marked spam)
Fraud detection → F1 (balance both, classes are highly imbalanced)
Regression
Metric Formula Notes MAE Mean |actual - predicted| Robust to outliers, interpretable (same units) MSE Mean (actual - predicted)² Penalises large errors heavily RMSE √MSE Same units as target, penalises outliers R² 1 - (SS_res / SS_tot) 0–1, how much variance explained (higher better) MAPE Mean |(actual-pred)/actual| × 100 Percentage error, intuitive
Ranking / Recommendation
Metric Use NDCG Quality of ranked results MAP Mean Average Precision across queries Hit Rate Did top-N recommendations include relevant item?
Cross-Validation
Reduces overfitting to a specific train/test split.
K-Fold Cross-Validation
Split data into K equal folds
Train on K-1 folds, validate on remaining 1
Repeat K times, average results
Standard K = 5 or 10
Stratified K-Fold
Same as K-fold but maintains class proportion in each fold
Use for imbalanced classification
Leave-One-Out (LOO)
K = N (every sample is a test set once)
Very expensive, use only for tiny datasets
Data Splits
Split Typical % Purpose Training 60–80% Model learns from this Validation 10–20% Tune hyperparameters, early stopping Test 10–20% Final unbiased evaluation — touch once
Never tune on test set. Never look at test set during development.
Algorithms Cheat Sheet
Supervised — Classification
Algorithm Type Strengths Weaknesses Logistic Regression Linear Fast, interpretable, probabilistic output Can't model non-linear boundaries Decision Tree Tree Interpretable, handles non-linear Overfits easily Random Forest Ensemble (Bagging) Robust, reduces variance, handles missing values Slower, less interpretable XGBoost / GBM Ensemble (Boosting) Best performance on tabular data, handles missing Needs tuning, can overfit SVM Kernel Effective in high dimensions Slow on large datasets, needs scaling K-NN Instance-based Simple, no training Slow at inference, needs feature scaling Naive Bayes Probabilistic Very fast, great for text Feature independence assumption
Supervised — Regression
Algorithm Notes Linear Regression Fast, interpretable baseline Ridge / Lasso Linear + regularisation Random Forest Regressor Non-linear, robust XGBoost Regressor Usually best on tabular Neural Network Non-linear, needs lots of data
Unsupervised
Algorithm Type Notes K-Means Clustering Must specify K, spherical clusters, Euclidean distance DBSCAN Clustering Finds arbitrary shapes, handles noise/outliers Hierarchical Clustering Dendrogram, no K needed PCA Dimensionality Reduction Linear, maximises variance retained t-SNE Dimensionality Reduction Non-linear, for visualisation only (not used for training) Autoencoders Dimensionality Reduction Neural network-based, non-linear Isolation Forest Anomaly Detection Isolates outliers via random splits
Deep Learning
Architecture Best For MLP (Dense layers) Tabular data CNN Images, spatial patterns RNN / LSTM / GRU Sequential data, time series, NLP (older) Transformer NLP, images (ViT), state of art Autoencoder Anomaly detection, compression GAN Image generation Diffusion Model High-quality image generation
Feature Engineering
Handling Missing Values
Method When to Use Drop rows Missing at random, small % Drop column >50% missing, column not important Mean/Median imputation Numerical, not skewed distribution Mode imputation Categorical Forward/Backward fill Time series Model-based imputation Complex missing patterns Add "missing" indicator column Missingness itself is informative
Encoding Categorical Variables
Method Use When One-Hot Encoding Low cardinality (< 10-15 categories) Label Encoding Ordinal categories (small, medium, large) Target Encoding High cardinality, avoid leakage Embedding High cardinality in deep learning
Scaling / Normalisation
Method Formula Use When Min-Max Scaling (x - min) / (max - min) → [0,1] Bounded range, neural networks Standardisation (Z-score) (x - mean) / std Normal distribution, SVM, PCA Log Transform log(x) Skewed distribution, right-tailed Robust Scaler Uses IQR Outliers present
Always fit scaler on training set only, then transform train + test. Never fit on full dataset — leakage.
Feature Selection
Method Type Correlation matrix Filter — remove highly correlated features Chi-squared test Filter — categorical features vs target L1 regularisation (Lasso) Embedded — zeroes out unimportant features Feature importance (tree models) Embedded — from Random Forest / XGBoost Recursive Feature Elimination (RFE) Wrapper — iteratively removes weakest PCA Dimensionality reduction — not true selection
Class Imbalance
Method Notes Oversample minority (SMOTE) Synthetic minority oversampling Undersample majority Lose data Class weights Penalise misclassification of minority more Threshold tuning Adjust decision threshold from 0.5 Anomaly detection framing Treat minority class as anomaly Collect more minority data Best solution if possible
Hyperparameters vs Parameters
Parameters Hyperparameters What Learned from data (weights, biases) Set before training Who sets Training algorithm Data scientist / HPO Examples Neural net weights, decision tree splits Learning rate, tree depth, num_estimators, epochs
Common Hyperparameters
Hyperparameter Effect of increasing Learning rate Faster but unstable; too high = diverges Batch size More stable gradients but slower; needs more memory Epochs / iterations More learning; risk of overfitting Tree depth More complex model; risk of overfitting n_estimators (trees) Better ensemble; diminishing returns Dropout rate More regularisation; too high = underfitting L1/L2 lambda More regularisation; too high = underfitting
Neural Network Basics
Input Layer → [Hidden Layers] → Output Layer
Each neuron: z = Wx + b → a = activation(z)
Activation Functions
Function Range Use ReLU [0, ∞) Default hidden layers, fast Leaky ReLU (-∞, ∞) Prevents dying ReLU Sigmoid (0, 1) Binary classification output Softmax (0,1), sums to 1 Multi-class output Tanh (-1, 1) Hidden layers, zero-centred
Optimisers
Optimiser Notes SGD Simple, may be slow, needs LR tuning Adam Adaptive LR, fast convergence, default choice RMSprop Good for RNNs AdaGrad Sparse gradients, NLP
Loss Functions
Task Loss Binary classification Binary cross-entropy Multi-class classification Categorical cross-entropy Regression MSE / MAE / Huber Object detection Focal loss
Transfer Learning
Take a model pre-trained on large dataset → fine-tune on your smaller dataset
Fine-tuning: unfreeze some or all layers, train with small learning rate
Feature extraction: freeze all layers, only train new classification head
Use when: limited labelled data, similar domain to pre-trained task
Ensemble Methods
Method How Examples Bagging Train N models on bootstrap samples, average Random Forest Boosting Train models sequentially, each corrects previous errors XGBoost, AdaBoost, LightGBM Stacking Train meta-model on predictions of base models Custom ensembles Voting Hard: majority vote; Soft: average probabilities Simple ensembles
XGBoost dominates tabular data on AWS exam scenarios. When in doubt for structured/tabular = XGBoost.