MLSemester 8

Unit 4: Unsupervised Learning & Evaluation

K-Means Clustering, Ensemble Methods (Boosting, Bagging, Random Forests), and model evaluation metrics โ€” accuracy, confusion matrix, precision, recall, F1, ROC-AUC, MAD.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Unsupervised Learning: Clustering: K-means. Ensemble Methods: Boosting, Bagging, Random Forests. Evaluation: Performance measurement of models in terms of accuracy, confusion matrix, precision & recall, F1-score, Receiver Operating Characteristic (ROC) curve and AUC, Median Absolute Deviation (MAD), Distribution of errors.


๐ŸŽฏ PYQ Analysis for Unit 4

High Priority Topics (15 marks questions)

  1. Bagging vs Boosting vs Random Forest โ€” (2024: 15 marks, 2023: 15 marks)
  2. ROC and AUC โ€” (2024: 15 marks, 2023: 7.5 marks, 2022: 7.5 marks)
  3. Confusion Matrix โ€” (2023: 7.5 marks, 2022: 7.5 marks)
  4. K-Means Algorithm with Example โ€” (2022: 7.5 marks)
  5. Random Forest Algorithm โ€” (2022: 7.5 marks)

Medium Priority Topics (Short answers)

  1. Boosting โ€” 2022 (2.5 marks)
  2. Median Absolute Deviation (MAD) โ€” 2023 (2.5 marks), 2022 (2.5 marks)
  3. F1-Score โ€” 2023 (2.5 marks)

Section 1: Unsupervised Learning โ€” Overview

1.1 Definition

Unsupervised Learning is a type of ML where the model is given only input data (X) โ€” no labels (Y). The model tries to find hidden patterns, structures, or groupings on its own.

Input:   Xโ‚, Xโ‚‚, Xโ‚ƒ, ..., Xโ‚˜   (no labels)
           โ”‚
           โ–ผ
     ML Algorithm
           โ”‚
           โ–ผ
  Output: Groups / Patterns / Structure

1.2 Types of Unsupervised Learning

TypeGoalExample
ClusteringGroup similar data pointsCustomer segmentation
AssociationFind co-occurrence rulesMarket basket analysis
Dimensionality ReductionReduce featuresPCA (covered in Unit 2)
Anomaly DetectionFind outliersFraud detection

Section 2: K-Means Clustering

PYQ: Explain the k-Means Algorithm with an example. (2022, 7.5 marks)

2.1 What is Clustering?

Definition:

Clustering is the process of grouping similar data points together (into clusters) such that:

  • Points within the same cluster are as similar as possible.
  • Points in different clusters are as different as possible.

Analogy:

Imagine sorting a basket of mixed fruits into groups โ€” oranges together, apples together, bananas together โ€” without being told the fruit names. You group them by similarity (colour, shape, size).


2.2 What is K-Means?

Definition:

K-Means is the most popular clustering algorithm. It partitions a dataset of m points into k clusters, where each point belongs to the cluster with the nearest mean (centroid).

"K" = number of clusters (chosen by the user before training).


2.3 K-Means Algorithm โ€” Step by Step

Input: Dataset X, number of clusters k

Step 1: Initialize
  โ”€ Randomly select k data points as initial centroids
    ฮผโ‚, ฮผโ‚‚, ..., ฮผโ‚–

Step 2: Assignment (E-step)
  โ”€ For each data point xแตข:
      Assign it to the nearest centroid:
      cแตข = argmin_j  ||xแตข - ฮผโฑผ||ยฒ

Step 3: Update (M-step)
  โ”€ Recompute each centroid as the mean of its assigned points:
      ฮผโฑผ = (1/|Cโฑผ|) ฮฃ xแตข   for all xแตข in cluster j

Step 4: Repeat
  โ”€ Repeat Steps 2 & 3 until centroids stop moving
    (convergence โ€” assignments don't change)

Output: k clusters with final centroids

Flow Diagram:

  Initialize k centroids
         โ”‚
         โ–ผ
  Assign each point to
  nearest centroid
         โ”‚
         โ–ผ
  Recompute centroids
  (mean of cluster)
         โ”‚
         โ–ผ
  Centroids changed?
    /         \
  Yes          No
   โ”‚            โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ–ผ
          โ”‚   DONE โœ“
          โ–ผ
  (loop back to assign)

2.4 Worked Example

Dataset (6 points, 2D), k=2:

Points:  A(1,1), B(1,2), C(2,1), D(5,4), E(6,5), F(5,5)

Iteration 1:

Step 1 โ€” Initialize centroids: Pick A(1,1) and D(5,4) randomly.

ฮผโ‚ = (1,1)    ฮผโ‚‚ = (5,4)

Step 2 โ€” Assign each point:

Point A(1,1): d(ฮผโ‚)=0.00, d(ฮผโ‚‚)=5.00  โ†’ Cluster 1
Point B(1,2): d(ฮผโ‚)=1.00, d(ฮผโ‚‚)=4.24  โ†’ Cluster 1
Point C(2,1): d(ฮผโ‚)=1.00, d(ฮผโ‚‚)=3.61  โ†’ Cluster 1
Point D(5,4): d(ฮผโ‚)=5.00, d(ฮผโ‚‚)=0.00  โ†’ Cluster 2
Point E(6,5): d(ฮผโ‚)=6.40, d(ฮผโ‚‚)=1.41  โ†’ Cluster 2
Point F(5,5): d(ฮผโ‚)=5.66, d(ฮผโ‚‚)=1.00  โ†’ Cluster 2

Step 3 โ€” Recompute centroids:

Cluster 1 = {A, B, C}
  ฮผโ‚ = ((1+1+2)/3, (1+2+1)/3) = (4/3, 4/3) โ‰ˆ (1.33, 1.33)

Cluster 2 = {D, E, F}
  ฮผโ‚‚ = ((5+6+5)/3, (4+5+5)/3) = (16/3, 14/3) โ‰ˆ (5.33, 4.67)

Iteration 2 โ€” Reassign with new centroids:

Distances from new ฮผโ‚=(1.33,1.33) and ฮผโ‚‚=(5.33,4.67):

A(1,1) โ†’ Cluster 1  โœ“
B(1,2) โ†’ Cluster 1  โœ“
C(2,1) โ†’ Cluster 1  โœ“
D(5,4) โ†’ Cluster 2  โœ“
E(6,5) โ†’ Cluster 2  โœ“
F(5,5) โ†’ Cluster 2  โœ“

Assignments unchanged โ†’ Convergence!

Final Clusters:

Cluster 1: {A(1,1), B(1,2), C(2,1)}   centroid โ‰ˆ (1.33, 1.33)
Cluster 2: {D(5,4), E(6,5), F(5,5)}   centroid โ‰ˆ (5.33, 4.67)

2.5 Choosing K โ€” The Elbow Method

The user must specify k before running K-Means. To find the best k:

Elbow Method:

  • Run K-Means for k = 1, 2, 3, ..., n.
  • For each k, compute WCSS (Within-Cluster Sum of Squares):
WCSS = ฮฃ  ฮฃ  ||xแตข - ฮผโฑผ||ยฒ
       j  xแตขโˆˆCโฑผ
  • Plot k vs WCSS. The "elbow" (sharp bend) gives the best k.
WCSS
โ”‚\
โ”‚ \
โ”‚  \
โ”‚   \โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ† elbow here (best k = 3)
โ”‚             (flat)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ k
  1  2  3  4  5  6

2.6 Advantages and Disadvantages

Advantages:

โœ… Simple and fast โ€” O(nยทkยทi) where i = iterations.
โœ… Scales well to large datasets.
โœ… Easy to implement and interpret.

Disadvantages:

โŒ Must specify k in advance.
โŒ Sensitive to initial centroid placement (different runs โ†’ different results).
โŒ Assumes spherical, equally-sized clusters.
โŒ Sensitive to outliers (outliers distort centroids).
โŒ Fails on non-convex cluster shapes.

Fix for initialization: Use K-Means++ โ€” smarter initialization that spreads centroids apart.

Applications:

  • Customer segmentation
  • Document clustering
  • Image compression
  • Anomaly detection

Section 3: Ensemble Methods

PYQ: Compare and contrast Boosting, Bagging, and Random Forest as Ensemble Methods in Unsupervised Learning. Discuss how these methods combine multiple models and improve the overall performance. Provide insights into scenarios where each ensemble method is most effective. (2024, 15 marks)

3.1 What are Ensemble Methods?

Definition:

Ensemble Methods combine multiple individual models (weak learners) to produce one stronger, more accurate model.

Core Idea: Wisdom of the crowd โ€” many weak models voting together beat one strong model.

     Model 1 โ”€โ”€โ”€โ”€โ”€โ”
     Model 2 โ”€โ”€โ”€โ”€โ”€โ”คโ”€โ”€ Combine โ”€โ”€โ–บ Final Prediction
     Model 3 โ”€โ”€โ”€โ”€โ”€โ”˜

Why do they work?

  • Individual models may make different errors.
  • Combining them cancels out individual errors.
  • Result: lower variance, lower bias, higher accuracy.

Three main strategies:

MethodCore IdeaModels Trained
BaggingTrain models in parallel on random subsetsIndependent
BoostingTrain models sequentially, each fixing previous errorsSequential
Random ForestBagging + random feature selectionIndependent

3.2 Bagging (Bootstrap Aggregating)

Definition:

Bagging creates multiple models by training each on a random bootstrap sample (sample with replacement) of the training data. Final prediction = majority vote (classification) or average (regression).

"Bootstrap" = random sampling with replacement.

Algorithm:

Input: Training dataset D, number of models B

For b = 1 to B:
  1. Draw a bootstrap sample Dแตฆ from D (with replacement).
  2. Train a model Mแตฆ on Dแตฆ.

Prediction for new point X:
  Classification: majority vote of Mโ‚(X), Mโ‚‚(X), ..., Mแดฎ(X)
  Regression:     average of Mโ‚(X), Mโ‚‚(X), ..., Mแดฎ(X)

Diagram:

        Original Dataset (D)
               โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ–ผ          โ–ผ          โ–ผ
 Sample Dโ‚  Sample Dโ‚‚  Sample Dโ‚ƒ   (bootstrap samples)
    โ”‚          โ”‚          โ”‚
  Model 1   Model 2   Model 3       (trained independently)
    โ”‚          โ”‚          โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ–ผ
          Majority Vote
               โ”‚
               โ–ผ
         Final Prediction

Out-of-Bag (OOB) Error:

  • Each bootstrap sample leaves out ~37% of training data.
  • These left-out points can be used as a validation set for free.
  • Called Out-of-Bag error โ€” a built-in cross-validation estimate.

Advantages:

โœ… Reduces variance (prevents overfitting).
โœ… Parallel training (fast).
โœ… Works well with high-variance models (e.g., deep decision trees).

Disadvantages:

โŒ Does not reduce bias.
โŒ Less interpretable than a single model.


3.3 Boosting

PYQ: Explain Boosting. (2022, 2.5 marks)

Definition:

Boosting trains models sequentially. Each new model focuses on correcting the mistakes of the previous model by giving higher weight to misclassified samples.

Core Idea: Start weak, get stronger step by step.

Algorithm (general):

Input: Training dataset D, number of models T

1. Initialize equal weights for all training points: wแตข = 1/m

For t = 1 to T:
  2. Train model Mโ‚œ on weighted dataset.
  3. Calculate error: ฮตโ‚œ = ฮฃ wแตข ร— I(Mโ‚œ(xแตข) โ‰  yแตข)
  4. Compute model weight: ฮฑโ‚œ = 0.5 ร— ln[(1-ฮตโ‚œ)/ฮตโ‚œ]
  5. Update sample weights:
       Misclassified: wแตข โ† wแตข ร— e^(ฮฑโ‚œ)   (increase weight)
       Correct:       wแตข โ† wแตข ร— e^(-ฮฑโ‚œ)  (decrease weight)
  6. Normalize weights.

Final prediction:
  ลถ = sign[ ฮฃ ฮฑโ‚œ ร— Mโ‚œ(X) ]

Diagram:

  Dataset
     โ”‚
     โ–ผ
  Model 1 โ”€โ”€โ–บ Errors โ”€โ”€โ–บ Increase weights of errors
     โ”‚
     โ–ผ
  Model 2 โ”€โ”€โ–บ Errors โ”€โ”€โ–บ Increase weights of errors
     โ”‚
     โ–ผ
  Model 3 โ”€โ”€โ–บ ...
     โ”‚
     โ–ผ
  Weighted Vote โ”€โ”€โ–บ Final Prediction

Popular Boosting Algorithms:

AlgorithmDescription
AdaBoostOriginal boosting โ€” adjusts sample weights
Gradient BoostingFits each model to residual errors
XGBoostOptimized gradient boosting with regularization
LightGBMFast, scalable gradient boosting

Advantages:

โœ… Reduces both bias and variance.
โœ… Often achieves best accuracy on tabular data.
โœ… Handles complex patterns.

Disadvantages:

โŒ Sequential โ€” cannot parallelize easily.
โŒ Prone to overfitting if too many rounds.
โŒ Sensitive to noisy data and outliers (amplifies errors).


3.4 Bagging vs Boosting

PYQ: Differentiate between Bagging and Boosting techniques. (2023, 15 marks)

FeatureBaggingBoosting
TrainingParallelSequential
FocusReduces VarianceReduces Bias
WeightsAll samples equalMisclassified get higher weight
Error FixedOverfittingUnderfitting
SpeedFasterSlower
Overfitting RiskLowHigher (needs tuning)
ExampleRandom ForestAdaBoost, XGBoost

3.5 Random Forest

PYQ: Describe the Random Forest algorithm to improve classifier accuracy. (2022, 7.5 marks)

Definition:

Random Forest is an ensemble method that builds many decision trees using:

  1. Bagging โ€” each tree is trained on a bootstrap sample.
  2. Random Feature Selection โ€” at each split, only a random subset of features is considered (not all features).

The final prediction is the majority vote (classification) or average (regression) of all trees.

Algorithm:

Input: Dataset D, number of trees T, features m (subset size)

For t = 1 to T:
  1. Draw bootstrap sample Dโ‚œ from D.
  2. Grow a decision tree on Dโ‚œ:
       At each node, randomly pick m features (m < total features).
       Split on the best among those m features.
  3. Grow tree fully (no pruning).

Prediction:
  Classification: majority vote of all T trees
  Regression: average of all T trees

Diagram:

      Dataset D
          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”
    โ–ผ     โ–ผ     โ–ผ
   Dโ‚    Dโ‚‚    Dโ‚ƒ    โ† bootstrap samples
    โ”‚     โ”‚     โ”‚
   Tโ‚    Tโ‚‚    Tโ‚ƒ    โ† trees (random features at each split)
    โ”‚     โ”‚     โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”˜
          โ–ผ
     Majority Vote
          โ”‚
          โ–ผ
    Final Prediction

Why random feature selection?

  • If one feature is very strong, all trees in bagging will use it โ†’ trees are correlated.
  • Randomly selecting features at each split decorrelates trees.
  • Decorrelated trees make better ensemble predictions.

Feature Importance in Random Forest:

  • Tracks how often each feature is used for splits and how much it reduces impurity.
  • Gives a natural feature importance ranking.

Advantages:

โœ… Very accurate โ€” one of the best general-purpose algorithms.
โœ… Handles thousands of features without feature selection.
โœ… Provides feature importance scores.
โœ… Robust to overfitting.
โœ… Handles missing values well.
โœ… OOB error is a built-in validation estimate.

Disadvantages:

โŒ Slow to predict (must traverse T trees).
โŒ Not interpretable (black box).
โŒ More memory usage.

Applications:

  • Stock market prediction
  • Medical diagnosis
  • Credit scoring
  • Remote sensing (land cover classification)

3.6 Comparison: Bagging vs Boosting vs Random Forest

PYQ: Compare and contrast Boosting, Bagging, and Random Forest as Ensemble Methods. Provide insights into scenarios where each ensemble method is most effective. (2024, 15 marks)

While Bagging, Boosting, and Random Forest are all ensemble methods, each combines models differently. Here is a side-by-side comparison.

Side-by-Side Comparison

FeatureBaggingBoostingRandom Forest
ExecutionParallelSequentialParallel
FocusReduces varianceReduces biasReduces variance + prevents overfitting
Model DependencyIndependent modelsModels depend on previous onesIndependent decision trees
Sample StrategyBootstrap (with replacement)Re-weighted samples (focus on errors)Bootstrap + random feature subset
Base LearnerAny (often decision trees)Any (often weak learners)Always decision trees
PerformanceStable resultsHigher accuracy but risk of overfittingBalanced trade-off
SpeedFast (can be parallelized)Slower (sequential steps)Moderate
AggregationMajority vote / averageWeighted sumMajority vote / average
Example AlgorithmBagged Decision TreesAdaBoost, XGBoost, Gradient BoostingRandom Forest

How Each Combines Models to Improve Performance

Bagging (Bootstrap Aggregation):

  • Multiple models are trained in parallel on bootstrapped subsets of data.
  • Final prediction is a majority vote (classification) or average (regression).
  • Reduces variance and increases stability of the model.

Boosting:

  • Models are trained sequentially, where each new model focuses on errors made by the previous one.
  • Misclassified data points are re-weighted so the next learner pays more attention to them.
  • Final model is a weighted sum of all weak learners โ€” reduces both bias and variance.

Random Forest:

  • Builds many decision trees using bagging plus a random subset of features at each split.
  • Random feature selection decorrelates the trees, which makes the ensemble stronger.
  • Final prediction is a majority vote (classification) or average (regression).

Use Cases & Effectiveness Scenarios

ScenarioMost Effective MethodWhy
High-variance models (e.g., deep decision trees)BaggingReduces variance and prevents overfitting on unstable learners
High-bias models (e.g., shallow trees, stumps)BoostingIteratively reduces bias by focusing on errors
Large, high-dimensional datasetsRandom ForestRandom feature selection handles many features and resists overfitting
Need for interpretabilityRandom ForestTree structures + feature importance allow some interpretability
Anomaly / outlier detectionBoosting (e.g., AdaBoost, Isolation Forest)Focuses iteratively on hard-to-fit points
Unstructured / noisy dataBaggingAveraging reduces noise and increases reliability
Need for top accuracy on tabular dataBoosting (XGBoost / LightGBM)Typically delivers state-of-the-art results
Need for stable, low-tuning baselineRandom ForestWorks well out of the box, minimal hyperparameter tuning

Summary

  • Bagging = best when the base model is unstable and overfitting is the problem.
  • Boosting = best when the base model is too simple and bias is the problem.
  • Random Forest = best all-rounder โ€” combines the variance-reduction of bagging with the decorrelation of random feature selection, giving balanced, robust performance.

Section 4: Model Evaluation

4.1 Why Evaluate?

A trained model must be tested on unseen data to check if it generalizes well. Evaluation metrics tell us how good the model is and where it fails.


4.2 Confusion Matrix

PYQ: Confusion Matrix. (2023, 7.5 marks)
PYQ: Confusion Matrix. (2022, 7.5 marks)

Definition:

A Confusion Matrix is a table that compares the predicted class labels against the actual class labels for a classification model.

For binary classification (2 classes: Positive and Negative):

                    PREDICTED
                  Positive   Negative
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
ACTUAL  Pos   โ”‚    TP    โ”‚    FN    โ”‚
        Neg   โ”‚    FP    โ”‚    TN    โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
TermFull FormMeaning
TPTrue PositivePredicted Positive, Actually Positive โœ…
TNTrue NegativePredicted Negative, Actually Negative โœ…
FPFalse PositivePredicted Positive, Actually Negative โŒ (Type I error)
FNFalse NegativePredicted Negative, Actually Positive โŒ (Type II error)

Example:

10 emails: 6 spam, 4 not-spam.
Model predicts: 5 spam, 5 not-spam.

              Predicted
              Spam   Not-Spam
Actual Spam  โ”‚ 4  โ”‚   2  โ”‚   (4 correct, 2 missed)
       Not   โ”‚ 1  โ”‚   3  โ”‚   (1 false alarm, 3 correct)

TP=4, FN=2, FP=1, TN=3


4.3 Accuracy

Definition: Percentage of all predictions that were correct.

Formula:

              TP + TN
Accuracy  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
             TP + TN + FP + FN

Example:

Accuracy = (4+3) / (4+2+1+3) = 7/10 = 0.70 = 70%

Limitation: Misleading when classes are imbalanced.

Example: 95% of emails are "Not Spam". A model that predicts everything as "Not Spam" gets 95% accuracy but catches zero spam.


4.4 Precision

Definition: Of all the samples predicted as Positive, how many were actually Positive?

Formula:

              TP
Precision = โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
            TP + FP

Example:

Precision = 4 / (4+1) = 4/5 = 0.80 = 80%

High Precision needed when: False Positives are costly.

  • Example: Cancer treatment โ€” don't want to treat healthy people.

4.5 Recall (Sensitivity / True Positive Rate)

Definition: Of all actual Positive samples, how many did the model correctly find?

Formula:

            TP
Recall =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
          TP + FN

Example:

Recall = 4 / (4+2) = 4/6 โ‰ˆ 0.67 = 67%

High Recall needed when: False Negatives are costly.

  • Example: Disease detection โ€” don't want to miss actual sick patients.

4.6 Precision vs Recall Trade-off

         High Precision โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ High Recall
              โ”‚                                   โ”‚
         Few FP                              Few FN
         (don't cry wolf)              (catch everything)
              โ”‚                                   โ”‚
         Spam filter                       Cancer screening
         (avoid blocking                  (avoid missing
          good emails)                      sick patients)
  • Increasing threshold โ†’ Higher Precision, Lower Recall.
  • Decreasing threshold โ†’ Higher Recall, Lower Precision.

4.7 F1-Score

PYQ: F1-score. (2023, 2.5 marks)

Definition:

F1-Score is the harmonic mean of Precision and Recall. It balances both when you need a single metric.

Formula:

              2 ร— Precision ร— Recall
F1-Score  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
               Precision + Recall

Example:

F1 = 2 ร— 0.80 ร— 0.67 / (0.80 + 0.67)
   = 2 ร— 0.536 / 1.47
   = 1.072 / 1.47
   โ‰ˆ 0.729 = 72.9%

When to use F1: When the dataset is imbalanced and you need to balance Precision and Recall.

Summary of all 4 metrics:

MetricFormulaMeaning
Accuracy(TP+TN)/(TP+TN+FP+FN)Overall correctness
PrecisionTP/(TP+FP)Quality of positive predictions
RecallTP/(TP+FN)Coverage of actual positives
F1-Score2ร—Pร—R/(P+R)Balance of Precision and Recall

4.8 ROC Curve and AUC

PYQ: Explain ROC and AUC. (2022, 7.5 marks)
PYQ: ROC and AUC. (2023, 7.5 marks)
PYQ: Describe the Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) as evaluation metrics for binary classification models. How does the ROC curve visualize the trade-off between a true positive rate and a false positive rate? What insights can be derived from the AUC value? (2024, 15 marks)

ROC Curve

Definition:

The ROC Curve (Receiver Operating Characteristic Curve) is a graph that shows the trade-off between:

  • True Positive Rate (TPR / Recall) on the Y-axis.
  • False Positive Rate (FPR) on the X-axis.

As the classification threshold changes from 0 to 1, both TPR and FPR change, tracing out the curve.

Formulas:

            TP
TPR  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   (= Recall = Sensitivity)
        TP + FN

            FP
FPR  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   (= 1 - Specificity)
        FP + TN

How to Plot ROC Curve:

  1. Start with threshold = 1 (predict all Negative) โ†’ TPR=0, FPR=0.
  2. Gradually lower threshold.
  3. More positives are predicted โ†’ TPR increases, FPR also increases.
  4. At threshold = 0 (predict all Positive) โ†’ TPR=1, FPR=1.

Diagram:

TPR
 1 โ”‚         โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   โ”‚      โ•ญโ”€โ”€โ•ฏ     โ† Good model (curves up-left)
   โ”‚   โ•ญโ”€โ”€โ•ฏ
   โ”‚โ•ญโ”€โ”€โ•ฏ
 0 โ”‚โ•ฏ______________
   0              1   FPR

   Diagonal line = Random classifier (AUC = 0.5)
   Top-left area = Best classifier (AUC = 1.0)

AUC (Area Under the ROC Curve)

Definition:

AUC is the area under the ROC curve. It measures the overall ability of the model to discriminate between classes.

Interpretation:

AUC ValueMeaning
AUC = 1.0Perfect classifier
AUC = 0.9โ€“1.0Excellent
AUC = 0.8โ€“0.9Good
AUC = 0.7โ€“0.8Fair
AUC = 0.6โ€“0.7Poor
AUC = 0.5Random guessing (no better than coin flip)
AUC < 0.5Worse than random

Advantages of ROC-AUC:

  • Works well with imbalanced datasets.
  • Threshold-independent โ€” evaluates model across all thresholds.
  • Single number summary of classifier performance.

4.9 Median Absolute Deviation (MAD)

PYQ: Median Absolute Deviation (MAD). (2022, 2.5 marks)
PYQ: Median Absolute Deviation. (2023, 2.5 marks)

Definition:

MAD is a robust measure of variability in predictions. It measures the median of the absolute differences between each prediction and the median prediction.

Steps:

Step 1: Find the median of all predictions/data:  M = median(y)

Step 2: Calculate absolute deviations:
         |yโ‚ - M|,  |yโ‚‚ - M|,  ...,  |yโ‚™ - M|

Step 3: MAD = median of those absolute deviations

Formula:

MAD = median( |yแตข - median(y)| )

Example:

Data: [2, 3, 5, 7, 11]

Step 1: Median = 5

Step 2: Absolute deviations:
  |2-5| = 3
  |3-5| = 2
  |5-5| = 0
  |7-5| = 2
  |11-5| = 6

  Sorted deviations: [0, 2, 2, 3, 6]

Step 3: MAD = median([0,2,2,3,6]) = 2

Why MAD over Standard Deviation?

  • Standard deviation is sensitive to outliers (uses squared differences).
  • MAD is robust โ€” outliers don't distort it.

Applications:

  • Regression model error analysis
  • Outlier detection
  • Robust statistics

4.10 Distribution of Errors

Definition:

Distribution of errors (also called residuals) refers to analyzing the pattern of prediction errors made by an ML model.

Error / Residual = Actual Value (y) - Predicted Value (ลท)
eแตข = yแตข - ลทแตข

Why Analyze Error Distribution?

  • Tells us if the model is making systematic mistakes or random ones.
  • Helps diagnose model problems.

Common Error Plots:

  1. Residual Plot: Plot errors vs predicted values. Good model = random scatter around zero.
Error
  โ”‚   โ—  โ—
  โ”‚ โ—       โ—    โ† Good (random scatter)
โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Predicted value
  โ”‚   โ—  โ—
  1. Histogram of Errors: Should be normally distributed (bell curve) with mean โ‰ˆ 0.
Count
  โ”‚      โ•ญโ”€โ”€โ”€โ•ฎ
  โ”‚   โ•ญโ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ•ฎ
  โ”‚ โ”€โ”€โ•ฏ           โ•ฐโ”€โ”€
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Error
          0
  1. Q-Q Plot: Compares error distribution to normal distribution.

Key Properties of a Well-Behaved Error Distribution:

PropertyMeaning
Mean โ‰ˆ 0No systematic bias
Normal shapeErrors are random, not patterned
Constant varianceHomoscedasticity (equal spread)
No autocorrelationErrors are independent of each other

Common Error Metrics for Regression:

MetricFormulaMeaning
MSE(1/m) ฮฃ(yแตข-ลทแตข)ยฒMean Squared Error โ€” penalizes big errors
RMSEโˆšMSERoot MSE โ€” same unit as output
MAE(1/m) ฮฃ|yแตข-ลทแตข|Mean Absolute Error โ€” robust to outliers
MADmedian(|yแตข-median(y)|)Median Absolute Deviation โ€” most robust

Quick Revision Points

K-Means:

  • Choose k โ†’ Initialize centroids โ†’ Assign points โ†’ Update centroids โ†’ Repeat.
  • Distance: Euclidean.
  • Best k: Elbow method (plot k vs WCSS).
  • Problem: sensitive to initialization and outliers. Fix: K-Means++.

Ensemble Methods:

BaggingBoostingRandom Forest
TrainingParallelSequentialParallel
ReducesVarianceBias+VarianceVariance
Key paramB (trees)T, learning rateT, m (features)
ExampleBagged TreesAdaBoost, XGBoostRandom Forest

Evaluation Metrics:

Accuracy  = (TP+TN) / (TP+TN+FP+FN)
Precision = TP / (TP+FP)
Recall    = TP / (TP+FN)
F1-Score  = 2ร—Pร—R / (P+R)
TPR       = TP / (TP+FN)   [= Recall]
FPR       = FP / (FP+TN)
MAD       = median(|yแตข - median(y)|)

ROC-AUC:

  • ROC plots TPR vs FPR at different thresholds.
  • AUC = area under ROC curve.
  • AUC = 1.0 โ†’ perfect; AUC = 0.5 โ†’ random.

Error Distribution:

  • Good model: errors are normally distributed, mean โ‰ˆ 0, random scatter.
  • MAD is more robust than standard deviation for error analysis.

Expected Exam Questions

15-Mark Questions:

  1. Differentiate between Bagging and Boosting techniques. (2023)
  2. Compare and contrast Boosting, Bagging, and Random Forest as Ensemble Methods. Discuss how these methods combine multiple models and improve overall performance. Provide insights into scenarios where each ensemble method is most effective. (2024)
  3. Describe the Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) as evaluation metrics for binary classification models. How does the ROC curve visualize the trade-off between TPR and FPR? What insights can be derived from the AUC value? (2024)

Short Answer Questions (2.5 marks):

  1. Boosting. (2022)
  2. Median Absolute Deviation (MAD). (2022, 2023)
  3. F1-score. (2023)

Mixed (7.5 marks):

  1. Explain the k-Means Algorithm with an example. (2022)
  2. Describe the Random Forest algorithm to improve classifier accuracy. (2022)
  3. ROC and AUC. (2022, 2023)
  4. Confusion Matrix. (2022, 2023)

These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!