MLSemester 8

Unit 4: Unsupervised Learning & Evaluation

K-Means Clustering, Ensemble Methods (Boosting, Bagging, Random Forests), and model evaluation metrics โ€” accuracy, confusion matrix, precision, recall, F1, ROC-AUC, MAD.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Unsupervised Learning: Clustering: K-means. Ensemble Methods: Boosting, Bagging, Random Forests. Evaluation: Performance measurement of models in terms of accuracy, confusion matrix, precision & recall, F1-score, Receiver Operating Characteristic (ROC) curve and AUC, Median Absolute Deviation (MAD), Distribution of errors.


๐ŸŽฏ PYQ Analysis for Unit 4

PYQs will be added after analysis โ€” check back soon.


Section 1: Unsupervised Learning โ€” Overview

1.1 Definition

Unsupervised Learning is a type of ML where the model is given only input data (X) โ€” no labels (Y). The model tries to find hidden patterns, structures, or groupings on its own.

Input:   Xโ‚, Xโ‚‚, Xโ‚ƒ, ..., Xโ‚˜   (no labels)
           โ”‚
           โ–ผ
     ML Algorithm
           โ”‚
           โ–ผ
  Output: Groups / Patterns / Structure

1.2 Types of Unsupervised Learning

TypeGoalExample
ClusteringGroup similar data pointsCustomer segmentation
AssociationFind co-occurrence rulesMarket basket analysis
Dimensionality ReductionReduce featuresPCA (covered in Unit 2)
Anomaly DetectionFind outliersFraud detection

Section 2: K-Means Clustering

2.1 What is Clustering?

Definition:

Clustering is the process of grouping similar data points together (into clusters) such that:

  • Points within the same cluster are as similar as possible.
  • Points in different clusters are as different as possible.

Analogy:

Imagine sorting a basket of mixed fruits into groups โ€” oranges together, apples together, bananas together โ€” without being told the fruit names. You group them by similarity (colour, shape, size).


2.2 What is K-Means?

Definition:

K-Means is the most popular clustering algorithm. It partitions a dataset of m points into k clusters, where each point belongs to the cluster with the nearest mean (centroid).

"K" = number of clusters (chosen by the user before training).


2.3 K-Means Algorithm โ€” Step by Step

Input: Dataset X, number of clusters k

Step 1: Initialize
  โ”€ Randomly select k data points as initial centroids
    ฮผโ‚, ฮผโ‚‚, ..., ฮผโ‚–

Step 2: Assignment (E-step)
  โ”€ For each data point xแตข:
      Assign it to the nearest centroid:
      cแตข = argmin_j  ||xแตข - ฮผโฑผ||ยฒ

Step 3: Update (M-step)
  โ”€ Recompute each centroid as the mean of its assigned points:
      ฮผโฑผ = (1/|Cโฑผ|) ฮฃ xแตข   for all xแตข in cluster j

Step 4: Repeat
  โ”€ Repeat Steps 2 & 3 until centroids stop moving
    (convergence โ€” assignments don't change)

Output: k clusters with final centroids

Flow Diagram:

  Initialize k centroids
         โ”‚
         โ–ผ
  Assign each point to
  nearest centroid
         โ”‚
         โ–ผ
  Recompute centroids
  (mean of cluster)
         โ”‚
         โ–ผ
  Centroids changed?
    /         \
  Yes          No
   โ”‚            โ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ–ผ
          โ”‚   DONE โœ“
          โ–ผ
  (loop back to assign)

2.4 Worked Example

Dataset (6 points, 2D), k=2:

Points:  A(1,1), B(1,2), C(2,1), D(5,4), E(6,5), F(5,5)

Iteration 1:

Step 1 โ€” Initialize centroids: Pick A(1,1) and D(5,4) randomly.

ฮผโ‚ = (1,1)    ฮผโ‚‚ = (5,4)

Step 2 โ€” Assign each point:

Point A(1,1): d(ฮผโ‚)=0.00, d(ฮผโ‚‚)=5.00  โ†’ Cluster 1
Point B(1,2): d(ฮผโ‚)=1.00, d(ฮผโ‚‚)=4.24  โ†’ Cluster 1
Point C(2,1): d(ฮผโ‚)=1.00, d(ฮผโ‚‚)=3.61  โ†’ Cluster 1
Point D(5,4): d(ฮผโ‚)=5.00, d(ฮผโ‚‚)=0.00  โ†’ Cluster 2
Point E(6,5): d(ฮผโ‚)=6.40, d(ฮผโ‚‚)=1.41  โ†’ Cluster 2
Point F(5,5): d(ฮผโ‚)=5.66, d(ฮผโ‚‚)=1.00  โ†’ Cluster 2

Step 3 โ€” Recompute centroids:

Cluster 1 = {A, B, C}
  ฮผโ‚ = ((1+1+2)/3, (1+2+1)/3) = (4/3, 4/3) โ‰ˆ (1.33, 1.33)

Cluster 2 = {D, E, F}
  ฮผโ‚‚ = ((5+6+5)/3, (4+5+5)/3) = (16/3, 14/3) โ‰ˆ (5.33, 4.67)

Iteration 2 โ€” Reassign with new centroids:

Distances from new ฮผโ‚=(1.33,1.33) and ฮผโ‚‚=(5.33,4.67):

A(1,1) โ†’ Cluster 1  โœ“
B(1,2) โ†’ Cluster 1  โœ“
C(2,1) โ†’ Cluster 1  โœ“
D(5,4) โ†’ Cluster 2  โœ“
E(6,5) โ†’ Cluster 2  โœ“
F(5,5) โ†’ Cluster 2  โœ“

Assignments unchanged โ†’ Convergence!

Final Clusters:

Cluster 1: {A(1,1), B(1,2), C(2,1)}   centroid โ‰ˆ (1.33, 1.33)
Cluster 2: {D(5,4), E(6,5), F(5,5)}   centroid โ‰ˆ (5.33, 4.67)

2.5 Choosing K โ€” The Elbow Method

The user must specify k before running K-Means. To find the best k:

Elbow Method:

  • Run K-Means for k = 1, 2, 3, ..., n.
  • For each k, compute WCSS (Within-Cluster Sum of Squares):
WCSS = ฮฃ  ฮฃ  ||xแตข - ฮผโฑผ||ยฒ
       j  xแตขโˆˆCโฑผ
  • Plot k vs WCSS. The "elbow" (sharp bend) gives the best k.
WCSS
โ”‚\
โ”‚ \
โ”‚  \
โ”‚   \โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ† elbow here (best k = 3)
โ”‚             (flat)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ k
  1  2  3  4  5  6

2.6 Advantages and Disadvantages

Advantages:

โœ… Simple and fast โ€” O(nยทkยทi) where i = iterations.
โœ… Scales well to large datasets.
โœ… Easy to implement and interpret.

Disadvantages:

โŒ Must specify k in advance.
โŒ Sensitive to initial centroid placement (different runs โ†’ different results).
โŒ Assumes spherical, equally-sized clusters.
โŒ Sensitive to outliers (outliers distort centroids).
โŒ Fails on non-convex cluster shapes.

Fix for initialization: Use K-Means++ โ€” smarter initialization that spreads centroids apart.

Applications:

  • Customer segmentation
  • Document clustering
  • Image compression
  • Anomaly detection

Section 3: Ensemble Methods

3.1 What are Ensemble Methods?

Definition:

Ensemble Methods combine multiple individual models (weak learners) to produce one stronger, more accurate model.

Core Idea: Wisdom of the crowd โ€” many weak models voting together beat one strong model.

     Model 1 โ”€โ”€โ”€โ”€โ”€โ”
     Model 2 โ”€โ”€โ”€โ”€โ”€โ”คโ”€โ”€ Combine โ”€โ”€โ–บ Final Prediction
     Model 3 โ”€โ”€โ”€โ”€โ”€โ”˜

Why do they work?

  • Individual models may make different errors.
  • Combining them cancels out individual errors.
  • Result: lower variance, lower bias, higher accuracy.

Three main strategies:

MethodCore IdeaModels Trained
BaggingTrain models in parallel on random subsetsIndependent
BoostingTrain models sequentially, each fixing previous errorsSequential
Random ForestBagging + random feature selectionIndependent

3.2 Bagging (Bootstrap Aggregating)

Definition:

Bagging creates multiple models by training each on a random bootstrap sample (sample with replacement) of the training data. Final prediction = majority vote (classification) or average (regression).

"Bootstrap" = random sampling with replacement.

Algorithm:

Input: Training dataset D, number of models B

For b = 1 to B:
  1. Draw a bootstrap sample Dแตฆ from D (with replacement).
  2. Train a model Mแตฆ on Dแตฆ.

Prediction for new point X:
  Classification: majority vote of Mโ‚(X), Mโ‚‚(X), ..., Mแดฎ(X)
  Regression:     average of Mโ‚(X), Mโ‚‚(X), ..., Mแดฎ(X)

Diagram:

        Original Dataset (D)
               โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ–ผ          โ–ผ          โ–ผ
 Sample Dโ‚  Sample Dโ‚‚  Sample Dโ‚ƒ   (bootstrap samples)
    โ”‚          โ”‚          โ”‚
  Model 1   Model 2   Model 3       (trained independently)
    โ”‚          โ”‚          โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ–ผ
          Majority Vote
               โ”‚
               โ–ผ
         Final Prediction

Out-of-Bag (OOB) Error:

  • Each bootstrap sample leaves out ~37% of training data.
  • These left-out points can be used as a validation set for free.
  • Called Out-of-Bag error โ€” a built-in cross-validation estimate.

Advantages:

โœ… Reduces variance (prevents overfitting).
โœ… Parallel training (fast).
โœ… Works well with high-variance models (e.g., deep decision trees).

Disadvantages:

โŒ Does not reduce bias.
โŒ Less interpretable than a single model.


3.3 Boosting

Definition:

Boosting trains models sequentially. Each new model focuses on correcting the mistakes of the previous model by giving higher weight to misclassified samples.

Core Idea: Start weak, get stronger step by step.

Algorithm (general):

Input: Training dataset D, number of models T

1. Initialize equal weights for all training points: wแตข = 1/m

For t = 1 to T:
  2. Train model Mโ‚œ on weighted dataset.
  3. Calculate error: ฮตโ‚œ = ฮฃ wแตข ร— I(Mโ‚œ(xแตข) โ‰  yแตข)
  4. Compute model weight: ฮฑโ‚œ = 0.5 ร— ln[(1-ฮตโ‚œ)/ฮตโ‚œ]
  5. Update sample weights:
       Misclassified: wแตข โ† wแตข ร— e^(ฮฑโ‚œ)   (increase weight)
       Correct:       wแตข โ† wแตข ร— e^(-ฮฑโ‚œ)  (decrease weight)
  6. Normalize weights.

Final prediction:
  ลถ = sign[ ฮฃ ฮฑโ‚œ ร— Mโ‚œ(X) ]

Diagram:

  Dataset
     โ”‚
     โ–ผ
  Model 1 โ”€โ”€โ–บ Errors โ”€โ”€โ–บ Increase weights of errors
     โ”‚
     โ–ผ
  Model 2 โ”€โ”€โ–บ Errors โ”€โ”€โ–บ Increase weights of errors
     โ”‚
     โ–ผ
  Model 3 โ”€โ”€โ–บ ...
     โ”‚
     โ–ผ
  Weighted Vote โ”€โ”€โ–บ Final Prediction

Popular Boosting Algorithms:

AlgorithmDescription
AdaBoostOriginal boosting โ€” adjusts sample weights
Gradient BoostingFits each model to residual errors
XGBoostOptimized gradient boosting with regularization
LightGBMFast, scalable gradient boosting

Advantages:

โœ… Reduces both bias and variance.
โœ… Often achieves best accuracy on tabular data.
โœ… Handles complex patterns.

Disadvantages:

โŒ Sequential โ€” cannot parallelize easily.
โŒ Prone to overfitting if too many rounds.
โŒ Sensitive to noisy data and outliers (amplifies errors).


3.4 Bagging vs Boosting

FeatureBaggingBoosting
TrainingParallelSequential
FocusReduces VarianceReduces Bias
WeightsAll samples equalMisclassified get higher weight
Error FixedOverfittingUnderfitting
SpeedFasterSlower
Overfitting RiskLowHigher (needs tuning)
ExampleRandom ForestAdaBoost, XGBoost

3.5 Random Forest

Definition:

Random Forest is an ensemble method that builds many decision trees using:

  1. Bagging โ€” each tree is trained on a bootstrap sample.
  2. Random Feature Selection โ€” at each split, only a random subset of features is considered (not all features).

The final prediction is the majority vote (classification) or average (regression) of all trees.

Algorithm:

Input: Dataset D, number of trees T, features m (subset size)

For t = 1 to T:
  1. Draw bootstrap sample Dโ‚œ from D.
  2. Grow a decision tree on Dโ‚œ:
       At each node, randomly pick m features (m < total features).
       Split on the best among those m features.
  3. Grow tree fully (no pruning).

Prediction:
  Classification: majority vote of all T trees
  Regression: average of all T trees

Diagram:

      Dataset D
          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”
    โ–ผ     โ–ผ     โ–ผ
   Dโ‚    Dโ‚‚    Dโ‚ƒ    โ† bootstrap samples
    โ”‚     โ”‚     โ”‚
   Tโ‚    Tโ‚‚    Tโ‚ƒ    โ† trees (random features at each split)
    โ”‚     โ”‚     โ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”˜
          โ–ผ
     Majority Vote
          โ”‚
          โ–ผ
    Final Prediction

Why random feature selection?

  • If one feature is very strong, all trees in bagging will use it โ†’ trees are correlated.
  • Randomly selecting features at each split decorrelates trees.
  • Decorrelated trees make better ensemble predictions.

Feature Importance in Random Forest:

  • Tracks how often each feature is used for splits and how much it reduces impurity.
  • Gives a natural feature importance ranking.

Advantages:

โœ… Very accurate โ€” one of the best general-purpose algorithms.
โœ… Handles thousands of features without feature selection.
โœ… Provides feature importance scores.
โœ… Robust to overfitting.
โœ… Handles missing values well.
โœ… OOB error is a built-in validation estimate.

Disadvantages:

โŒ Slow to predict (must traverse T trees).
โŒ Not interpretable (black box).
โŒ More memory usage.

Applications:

  • Stock market prediction
  • Medical diagnosis
  • Credit scoring
  • Remote sensing (land cover classification)

Section 4: Model Evaluation

4.1 Why Evaluate?

A trained model must be tested on unseen data to check if it generalizes well. Evaluation metrics tell us how good the model is and where it fails.


4.2 Confusion Matrix

Definition:

A Confusion Matrix is a table that compares the predicted class labels against the actual class labels for a classification model.

For binary classification (2 classes: Positive and Negative):

                    PREDICTED
                  Positive   Negative
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
ACTUAL  Pos   โ”‚    TP    โ”‚    FN    โ”‚
        Neg   โ”‚    FP    โ”‚    TN    โ”‚
              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
TermFull FormMeaning
TPTrue PositivePredicted Positive, Actually Positive โœ…
TNTrue NegativePredicted Negative, Actually Negative โœ…
FPFalse PositivePredicted Positive, Actually Negative โŒ (Type I error)
FNFalse NegativePredicted Negative, Actually Positive โŒ (Type II error)

Example:

10 emails: 6 spam, 4 not-spam.
Model predicts: 5 spam, 5 not-spam.

              Predicted
              Spam   Not-Spam
Actual Spam  โ”‚ 4  โ”‚   2  โ”‚   (4 correct, 2 missed)
       Not   โ”‚ 1  โ”‚   3  โ”‚   (1 false alarm, 3 correct)

TP=4, FN=2, FP=1, TN=3


4.3 Accuracy

Definition: Percentage of all predictions that were correct.

Formula:

              TP + TN
Accuracy  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
             TP + TN + FP + FN

Example:

Accuracy = (4+3) / (4+2+1+3) = 7/10 = 0.70 = 70%

Limitation: Misleading when classes are imbalanced.

Example: 95% of emails are "Not Spam". A model that predicts everything as "Not Spam" gets 95% accuracy but catches zero spam.


4.4 Precision

Definition: Of all the samples predicted as Positive, how many were actually Positive?

Formula:

              TP
Precision = โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
            TP + FP

Example:

Precision = 4 / (4+1) = 4/5 = 0.80 = 80%

High Precision needed when: False Positives are costly.

  • Example: Cancer treatment โ€” don't want to treat healthy people.

4.5 Recall (Sensitivity / True Positive Rate)

Definition: Of all actual Positive samples, how many did the model correctly find?

Formula:

            TP
Recall =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
          TP + FN

Example:

Recall = 4 / (4+2) = 4/6 โ‰ˆ 0.67 = 67%

High Recall needed when: False Negatives are costly.

  • Example: Disease detection โ€” don't want to miss actual sick patients.

4.6 Precision vs Recall Trade-off

         High Precision โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ High Recall
              โ”‚                                   โ”‚
         Few FP                              Few FN
         (don't cry wolf)              (catch everything)
              โ”‚                                   โ”‚
         Spam filter                       Cancer screening
         (avoid blocking                  (avoid missing
          good emails)                      sick patients)
  • Increasing threshold โ†’ Higher Precision, Lower Recall.
  • Decreasing threshold โ†’ Higher Recall, Lower Precision.

4.7 F1-Score

Definition:

F1-Score is the harmonic mean of Precision and Recall. It balances both when you need a single metric.

Formula:

              2 ร— Precision ร— Recall
F1-Score  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
               Precision + Recall

Example:

F1 = 2 ร— 0.80 ร— 0.67 / (0.80 + 0.67)
   = 2 ร— 0.536 / 1.47
   = 1.072 / 1.47
   โ‰ˆ 0.729 = 72.9%

When to use F1: When the dataset is imbalanced and you need to balance Precision and Recall.

Summary of all 4 metrics:

MetricFormulaMeaning
Accuracy(TP+TN)/(TP+TN+FP+FN)Overall correctness
PrecisionTP/(TP+FP)Quality of positive predictions
RecallTP/(TP+FN)Coverage of actual positives
F1-Score2ร—Pร—R/(P+R)Balance of Precision and Recall

4.8 ROC Curve and AUC

ROC Curve

Definition:

The ROC Curve (Receiver Operating Characteristic Curve) is a graph that shows the trade-off between:

  • True Positive Rate (TPR / Recall) on the Y-axis.
  • False Positive Rate (FPR) on the X-axis.

As the classification threshold changes from 0 to 1, both TPR and FPR change, tracing out the curve.

Formulas:

            TP
TPR  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   (= Recall = Sensitivity)
        TP + FN

            FP
FPR  =  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   (= 1 - Specificity)
        FP + TN

How to Plot ROC Curve:

  1. Start with threshold = 1 (predict all Negative) โ†’ TPR=0, FPR=0.
  2. Gradually lower threshold.
  3. More positives are predicted โ†’ TPR increases, FPR also increases.
  4. At threshold = 0 (predict all Positive) โ†’ TPR=1, FPR=1.

Diagram:

TPR
 1 โ”‚         โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
   โ”‚      โ•ญโ”€โ”€โ•ฏ     โ† Good model (curves up-left)
   โ”‚   โ•ญโ”€โ”€โ•ฏ
   โ”‚โ•ญโ”€โ”€โ•ฏ
 0 โ”‚โ•ฏ______________
   0              1   FPR

   Diagonal line = Random classifier (AUC = 0.5)
   Top-left area = Best classifier (AUC = 1.0)

AUC (Area Under the ROC Curve)

Definition:

AUC is the area under the ROC curve. It measures the overall ability of the model to discriminate between classes.

Interpretation:

AUC ValueMeaning
AUC = 1.0Perfect classifier
AUC = 0.9โ€“1.0Excellent
AUC = 0.8โ€“0.9Good
AUC = 0.7โ€“0.8Fair
AUC = 0.6โ€“0.7Poor
AUC = 0.5Random guessing (no better than coin flip)
AUC < 0.5Worse than random

Advantages of ROC-AUC:

  • Works well with imbalanced datasets.
  • Threshold-independent โ€” evaluates model across all thresholds.
  • Single number summary of classifier performance.

4.9 Median Absolute Deviation (MAD)

Definition:

MAD is a robust measure of variability in predictions. It measures the median of the absolute differences between each prediction and the median prediction.

Steps:

Step 1: Find the median of all predictions/data:  M = median(y)

Step 2: Calculate absolute deviations:
         |yโ‚ - M|,  |yโ‚‚ - M|,  ...,  |yโ‚™ - M|

Step 3: MAD = median of those absolute deviations

Formula:

MAD = median( |yแตข - median(y)| )

Example:

Data: [2, 3, 5, 7, 11]

Step 1: Median = 5

Step 2: Absolute deviations:
  |2-5| = 3
  |3-5| = 2
  |5-5| = 0
  |7-5| = 2
  |11-5| = 6

  Sorted deviations: [0, 2, 2, 3, 6]

Step 3: MAD = median([0,2,2,3,6]) = 2

Why MAD over Standard Deviation?

  • Standard deviation is sensitive to outliers (uses squared differences).
  • MAD is robust โ€” outliers don't distort it.

Applications:

  • Regression model error analysis
  • Outlier detection
  • Robust statistics

4.10 Distribution of Errors

Definition:

Distribution of errors (also called residuals) refers to analyzing the pattern of prediction errors made by an ML model.

Error / Residual = Actual Value (y) - Predicted Value (ลท)
eแตข = yแตข - ลทแตข

Why Analyze Error Distribution?

  • Tells us if the model is making systematic mistakes or random ones.
  • Helps diagnose model problems.

Common Error Plots:

  1. Residual Plot: Plot errors vs predicted values. Good model = random scatter around zero.
Error
  โ”‚   โ—  โ—
  โ”‚ โ—       โ—    โ† Good (random scatter)
โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Predicted value
  โ”‚   โ—  โ—
  1. Histogram of Errors: Should be normally distributed (bell curve) with mean โ‰ˆ 0.
Count
  โ”‚      โ•ญโ”€โ”€โ”€โ•ฎ
  โ”‚   โ•ญโ”€โ”€โ•ฏ   โ•ฐโ”€โ”€โ•ฎ
  โ”‚ โ”€โ”€โ•ฏ           โ•ฐโ”€โ”€
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Error
          0
  1. Q-Q Plot: Compares error distribution to normal distribution.

Key Properties of a Well-Behaved Error Distribution:

PropertyMeaning
Mean โ‰ˆ 0No systematic bias
Normal shapeErrors are random, not patterned
Constant varianceHomoscedasticity (equal spread)
No autocorrelationErrors are independent of each other

Common Error Metrics for Regression:

MetricFormulaMeaning
MSE(1/m) ฮฃ(yแตข-ลทแตข)ยฒMean Squared Error โ€” penalizes big errors
RMSEโˆšMSERoot MSE โ€” same unit as output
MAE(1/m) ฮฃ|yแตข-ลทแตข|Mean Absolute Error โ€” robust to outliers
MADmedian(|yแตข-median(y)|)Median Absolute Deviation โ€” most robust

Quick Revision Points

K-Means:

  • Choose k โ†’ Initialize centroids โ†’ Assign points โ†’ Update centroids โ†’ Repeat.
  • Distance: Euclidean.
  • Best k: Elbow method (plot k vs WCSS).
  • Problem: sensitive to initialization and outliers. Fix: K-Means++.

Ensemble Methods:

BaggingBoostingRandom Forest
TrainingParallelSequentialParallel
ReducesVarianceBias+VarianceVariance
Key paramB (trees)T, learning rateT, m (features)
ExampleBagged TreesAdaBoost, XGBoostRandom Forest

Evaluation Metrics:

Accuracy  = (TP+TN) / (TP+TN+FP+FN)
Precision = TP / (TP+FP)
Recall    = TP / (TP+FN)
F1-Score  = 2ร—Pร—R / (P+R)
TPR       = TP / (TP+FN)   [= Recall]
FPR       = FP / (FP+TN)
MAD       = median(|yแตข - median(y)|)

ROC-AUC:

  • ROC plots TPR vs FPR at different thresholds.
  • AUC = area under ROC curve.
  • AUC = 1.0 โ†’ perfect; AUC = 0.5 โ†’ random.

Error Distribution:

  • Good model: errors are normally distributed, mean โ‰ˆ 0, random scatter.
  • MAD is more robust than standard deviation for error analysis.

Expected Exam Questions

PYQs will be added after analysis โ€” check back soon.


These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!