Syllabus:
Unsupervised Learning: Clustering: K-means. Ensemble Methods: Boosting, Bagging, Random Forests. Evaluation: Performance measurement of models in terms of accuracy, confusion matrix, precision & recall, F1-score, Receiver Operating Characteristic (ROC) curve and AUC, Median Absolute Deviation (MAD), Distribution of errors.
๐ฏ PYQ Analysis for Unit 4
PYQs will be added after analysis โ check back soon.
Section 1: Unsupervised Learning โ Overview
1.1 Definition
Unsupervised Learning is a type of ML where the model is given only input data (X) โ no labels (Y). The model tries to find hidden patterns, structures, or groupings on its own.
Input: Xโ, Xโ, Xโ, ..., Xโ (no labels)
โ
โผ
ML Algorithm
โ
โผ
Output: Groups / Patterns / Structure
1.2 Types of Unsupervised Learning
| Type | Goal | Example |
|---|---|---|
| Clustering | Group similar data points | Customer segmentation |
| Association | Find co-occurrence rules | Market basket analysis |
| Dimensionality Reduction | Reduce features | PCA (covered in Unit 2) |
| Anomaly Detection | Find outliers | Fraud detection |
Section 2: K-Means Clustering
2.1 What is Clustering?
Definition:
Clustering is the process of grouping similar data points together (into clusters) such that:
- Points within the same cluster are as similar as possible.
- Points in different clusters are as different as possible.
Analogy:
Imagine sorting a basket of mixed fruits into groups โ oranges together, apples together, bananas together โ without being told the fruit names. You group them by similarity (colour, shape, size).
2.2 What is K-Means?
Definition:
K-Means is the most popular clustering algorithm. It partitions a dataset of m points into k clusters, where each point belongs to the cluster with the nearest mean (centroid).
"K" = number of clusters (chosen by the user before training).
2.3 K-Means Algorithm โ Step by Step
Input: Dataset X, number of clusters k
Step 1: Initialize
โ Randomly select k data points as initial centroids
ฮผโ, ฮผโ, ..., ฮผโ
Step 2: Assignment (E-step)
โ For each data point xแตข:
Assign it to the nearest centroid:
cแตข = argmin_j ||xแตข - ฮผโฑผ||ยฒ
Step 3: Update (M-step)
โ Recompute each centroid as the mean of its assigned points:
ฮผโฑผ = (1/|Cโฑผ|) ฮฃ xแตข for all xแตข in cluster j
Step 4: Repeat
โ Repeat Steps 2 & 3 until centroids stop moving
(convergence โ assignments don't change)
Output: k clusters with final centroids
Flow Diagram:
Initialize k centroids
โ
โผ
Assign each point to
nearest centroid
โ
โผ
Recompute centroids
(mean of cluster)
โ
โผ
Centroids changed?
/ \
Yes No
โ โ
โโโโโโโโ โผ
โ DONE โ
โผ
(loop back to assign)
2.4 Worked Example
Dataset (6 points, 2D), k=2:
Points: A(1,1), B(1,2), C(2,1), D(5,4), E(6,5), F(5,5)
Iteration 1:
Step 1 โ Initialize centroids: Pick A(1,1) and D(5,4) randomly.
ฮผโ = (1,1) ฮผโ = (5,4)
Step 2 โ Assign each point:
Point A(1,1): d(ฮผโ)=0.00, d(ฮผโ)=5.00 โ Cluster 1
Point B(1,2): d(ฮผโ)=1.00, d(ฮผโ)=4.24 โ Cluster 1
Point C(2,1): d(ฮผโ)=1.00, d(ฮผโ)=3.61 โ Cluster 1
Point D(5,4): d(ฮผโ)=5.00, d(ฮผโ)=0.00 โ Cluster 2
Point E(6,5): d(ฮผโ)=6.40, d(ฮผโ)=1.41 โ Cluster 2
Point F(5,5): d(ฮผโ)=5.66, d(ฮผโ)=1.00 โ Cluster 2
Step 3 โ Recompute centroids:
Cluster 1 = {A, B, C}
ฮผโ = ((1+1+2)/3, (1+2+1)/3) = (4/3, 4/3) โ (1.33, 1.33)
Cluster 2 = {D, E, F}
ฮผโ = ((5+6+5)/3, (4+5+5)/3) = (16/3, 14/3) โ (5.33, 4.67)
Iteration 2 โ Reassign with new centroids:
Distances from new ฮผโ=(1.33,1.33) and ฮผโ=(5.33,4.67):
A(1,1) โ Cluster 1 โ
B(1,2) โ Cluster 1 โ
C(2,1) โ Cluster 1 โ
D(5,4) โ Cluster 2 โ
E(6,5) โ Cluster 2 โ
F(5,5) โ Cluster 2 โ
Assignments unchanged โ Convergence!
Final Clusters:
Cluster 1: {A(1,1), B(1,2), C(2,1)} centroid โ (1.33, 1.33)
Cluster 2: {D(5,4), E(6,5), F(5,5)} centroid โ (5.33, 4.67)
2.5 Choosing K โ The Elbow Method
The user must specify k before running K-Means. To find the best k:
Elbow Method:
- Run K-Means for k = 1, 2, 3, ..., n.
- For each k, compute WCSS (Within-Cluster Sum of Squares):
WCSS = ฮฃ ฮฃ ||xแตข - ฮผโฑผ||ยฒ
j xแตขโCโฑผ
- Plot k vs WCSS. The "elbow" (sharp bend) gives the best k.
WCSS
โ\
โ \
โ \
โ \โโโโโโโโโโโโโโ โ elbow here (best k = 3)
โ (flat)
โโโโโโโโโโโโโโโโโโโโ k
1 2 3 4 5 6
2.6 Advantages and Disadvantages
Advantages:
โ
Simple and fast โ O(nยทkยทi) where i = iterations.
โ
Scales well to large datasets.
โ
Easy to implement and interpret.
Disadvantages:
โ Must specify k in advance.
โ Sensitive to initial centroid placement (different runs โ different results).
โ Assumes spherical, equally-sized clusters.
โ Sensitive to outliers (outliers distort centroids).
โ Fails on non-convex cluster shapes.
Fix for initialization: Use K-Means++ โ smarter initialization that spreads centroids apart.
Applications:
- Customer segmentation
- Document clustering
- Image compression
- Anomaly detection
Section 3: Ensemble Methods
3.1 What are Ensemble Methods?
Definition:
Ensemble Methods combine multiple individual models (weak learners) to produce one stronger, more accurate model.
Core Idea: Wisdom of the crowd โ many weak models voting together beat one strong model.
Model 1 โโโโโโ
Model 2 โโโโโโคโโ Combine โโโบ Final Prediction
Model 3 โโโโโโ
Why do they work?
- Individual models may make different errors.
- Combining them cancels out individual errors.
- Result: lower variance, lower bias, higher accuracy.
Three main strategies:
| Method | Core Idea | Models Trained |
|---|---|---|
| Bagging | Train models in parallel on random subsets | Independent |
| Boosting | Train models sequentially, each fixing previous errors | Sequential |
| Random Forest | Bagging + random feature selection | Independent |
3.2 Bagging (Bootstrap Aggregating)
Definition:
Bagging creates multiple models by training each on a random bootstrap sample (sample with replacement) of the training data. Final prediction = majority vote (classification) or average (regression).
"Bootstrap" = random sampling with replacement.
Algorithm:
Input: Training dataset D, number of models B
For b = 1 to B:
1. Draw a bootstrap sample Dแตฆ from D (with replacement).
2. Train a model Mแตฆ on Dแตฆ.
Prediction for new point X:
Classification: majority vote of Mโ(X), Mโ(X), ..., Mแดฎ(X)
Regression: average of Mโ(X), Mโ(X), ..., Mแดฎ(X)
Diagram:
Original Dataset (D)
โ
โโโโโโโโโโโโผโโโโโโโโโโโ
โผ โผ โผ
Sample Dโ Sample Dโ Sample Dโ (bootstrap samples)
โ โ โ
Model 1 Model 2 Model 3 (trained independently)
โ โ โ
โโโโโโโโโโโโผโโโโโโโโโโโ
โผ
Majority Vote
โ
โผ
Final Prediction
Out-of-Bag (OOB) Error:
- Each bootstrap sample leaves out ~37% of training data.
- These left-out points can be used as a validation set for free.
- Called Out-of-Bag error โ a built-in cross-validation estimate.
Advantages:
โ
Reduces variance (prevents overfitting).
โ
Parallel training (fast).
โ
Works well with high-variance models (e.g., deep decision trees).
Disadvantages:
โ Does not reduce bias.
โ Less interpretable than a single model.
3.3 Boosting
Definition:
Boosting trains models sequentially. Each new model focuses on correcting the mistakes of the previous model by giving higher weight to misclassified samples.
Core Idea: Start weak, get stronger step by step.
Algorithm (general):
Input: Training dataset D, number of models T
1. Initialize equal weights for all training points: wแตข = 1/m
For t = 1 to T:
2. Train model Mโ on weighted dataset.
3. Calculate error: ฮตโ = ฮฃ wแตข ร I(Mโ(xแตข) โ yแตข)
4. Compute model weight: ฮฑโ = 0.5 ร ln[(1-ฮตโ)/ฮตโ]
5. Update sample weights:
Misclassified: wแตข โ wแตข ร e^(ฮฑโ) (increase weight)
Correct: wแตข โ wแตข ร e^(-ฮฑโ) (decrease weight)
6. Normalize weights.
Final prediction:
ลถ = sign[ ฮฃ ฮฑโ ร Mโ(X) ]
Diagram:
Dataset
โ
โผ
Model 1 โโโบ Errors โโโบ Increase weights of errors
โ
โผ
Model 2 โโโบ Errors โโโบ Increase weights of errors
โ
โผ
Model 3 โโโบ ...
โ
โผ
Weighted Vote โโโบ Final Prediction
Popular Boosting Algorithms:
| Algorithm | Description |
|---|---|
| AdaBoost | Original boosting โ adjusts sample weights |
| Gradient Boosting | Fits each model to residual errors |
| XGBoost | Optimized gradient boosting with regularization |
| LightGBM | Fast, scalable gradient boosting |
Advantages:
โ
Reduces both bias and variance.
โ
Often achieves best accuracy on tabular data.
โ
Handles complex patterns.
Disadvantages:
โ Sequential โ cannot parallelize easily.
โ Prone to overfitting if too many rounds.
โ Sensitive to noisy data and outliers (amplifies errors).
3.4 Bagging vs Boosting
| Feature | Bagging | Boosting |
|---|---|---|
| Training | Parallel | Sequential |
| Focus | Reduces Variance | Reduces Bias |
| Weights | All samples equal | Misclassified get higher weight |
| Error Fixed | Overfitting | Underfitting |
| Speed | Faster | Slower |
| Overfitting Risk | Low | Higher (needs tuning) |
| Example | Random Forest | AdaBoost, XGBoost |
3.5 Random Forest
Definition:
Random Forest is an ensemble method that builds many decision trees using:
- Bagging โ each tree is trained on a bootstrap sample.
- Random Feature Selection โ at each split, only a random subset of features is considered (not all features).
The final prediction is the majority vote (classification) or average (regression) of all trees.
Algorithm:
Input: Dataset D, number of trees T, features m (subset size)
For t = 1 to T:
1. Draw bootstrap sample Dโ from D.
2. Grow a decision tree on Dโ:
At each node, randomly pick m features (m < total features).
Split on the best among those m features.
3. Grow tree fully (no pruning).
Prediction:
Classification: majority vote of all T trees
Regression: average of all T trees
Diagram:
Dataset D
โ
โโโโโโโผโโโโโโ
โผ โผ โผ
Dโ Dโ Dโ โ bootstrap samples
โ โ โ
Tโ Tโ Tโ โ trees (random features at each split)
โ โ โ
โโโโโโโผโโโโโโ
โผ
Majority Vote
โ
โผ
Final Prediction
Why random feature selection?
- If one feature is very strong, all trees in bagging will use it โ trees are correlated.
- Randomly selecting features at each split decorrelates trees.
- Decorrelated trees make better ensemble predictions.
Feature Importance in Random Forest:
- Tracks how often each feature is used for splits and how much it reduces impurity.
- Gives a natural feature importance ranking.
Advantages:
โ
Very accurate โ one of the best general-purpose algorithms.
โ
Handles thousands of features without feature selection.
โ
Provides feature importance scores.
โ
Robust to overfitting.
โ
Handles missing values well.
โ
OOB error is a built-in validation estimate.
Disadvantages:
โ Slow to predict (must traverse T trees).
โ Not interpretable (black box).
โ More memory usage.
Applications:
- Stock market prediction
- Medical diagnosis
- Credit scoring
- Remote sensing (land cover classification)
Section 4: Model Evaluation
4.1 Why Evaluate?
A trained model must be tested on unseen data to check if it generalizes well. Evaluation metrics tell us how good the model is and where it fails.
4.2 Confusion Matrix
Definition:
A Confusion Matrix is a table that compares the predicted class labels against the actual class labels for a classification model.
For binary classification (2 classes: Positive and Negative):
PREDICTED
Positive Negative
โโโโโโโโโโโโฌโโโโโโโโโโโ
ACTUAL Pos โ TP โ FN โ
Neg โ FP โ TN โ
โโโโโโโโโโโโดโโโโโโโโโโโ
| Term | Full Form | Meaning |
|---|---|---|
| TP | True Positive | Predicted Positive, Actually Positive โ |
| TN | True Negative | Predicted Negative, Actually Negative โ |
| FP | False Positive | Predicted Positive, Actually Negative โ (Type I error) |
| FN | False Negative | Predicted Negative, Actually Positive โ (Type II error) |
Example:
10 emails: 6 spam, 4 not-spam.
Model predicts: 5 spam, 5 not-spam.
Predicted
Spam Not-Spam
Actual Spam โ 4 โ 2 โ (4 correct, 2 missed)
Not โ 1 โ 3 โ (1 false alarm, 3 correct)
TP=4, FN=2, FP=1, TN=3
4.3 Accuracy
Definition: Percentage of all predictions that were correct.
Formula:
TP + TN
Accuracy = โโโโโโโโโโโโโโโโโโโโโ
TP + TN + FP + FN
Example:
Accuracy = (4+3) / (4+2+1+3) = 7/10 = 0.70 = 70%
Limitation: Misleading when classes are imbalanced.
Example: 95% of emails are "Not Spam". A model that predicts everything as "Not Spam" gets 95% accuracy but catches zero spam.
4.4 Precision
Definition: Of all the samples predicted as Positive, how many were actually Positive?
Formula:
TP
Precision = โโโโโโโโ
TP + FP
Example:
Precision = 4 / (4+1) = 4/5 = 0.80 = 80%
High Precision needed when: False Positives are costly.
- Example: Cancer treatment โ don't want to treat healthy people.
4.5 Recall (Sensitivity / True Positive Rate)
Definition: Of all actual Positive samples, how many did the model correctly find?
Formula:
TP
Recall = โโโโโโโโ
TP + FN
Example:
Recall = 4 / (4+2) = 4/6 โ 0.67 = 67%
High Recall needed when: False Negatives are costly.
- Example: Disease detection โ don't want to miss actual sick patients.
4.6 Precision vs Recall Trade-off
High Precision โโโโโโโโโโโโโโโโโโโโ High Recall
โ โ
Few FP Few FN
(don't cry wolf) (catch everything)
โ โ
Spam filter Cancer screening
(avoid blocking (avoid missing
good emails) sick patients)
- Increasing threshold โ Higher Precision, Lower Recall.
- Decreasing threshold โ Higher Recall, Lower Precision.
4.7 F1-Score
Definition:
F1-Score is the harmonic mean of Precision and Recall. It balances both when you need a single metric.
Formula:
2 ร Precision ร Recall
F1-Score = โโโโโโโโโโโโโโโโโโโโโโโโ
Precision + Recall
Example:
F1 = 2 ร 0.80 ร 0.67 / (0.80 + 0.67)
= 2 ร 0.536 / 1.47
= 1.072 / 1.47
โ 0.729 = 72.9%
When to use F1: When the dataset is imbalanced and you need to balance Precision and Recall.
Summary of all 4 metrics:
| Metric | Formula | Meaning |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Overall correctness |
| Precision | TP/(TP+FP) | Quality of positive predictions |
| Recall | TP/(TP+FN) | Coverage of actual positives |
| F1-Score | 2รPรR/(P+R) | Balance of Precision and Recall |
4.8 ROC Curve and AUC
ROC Curve
Definition:
The ROC Curve (Receiver Operating Characteristic Curve) is a graph that shows the trade-off between:
- True Positive Rate (TPR / Recall) on the Y-axis.
- False Positive Rate (FPR) on the X-axis.
As the classification threshold changes from 0 to 1, both TPR and FPR change, tracing out the curve.
Formulas:
TP
TPR = โโโโโโโโ (= Recall = Sensitivity)
TP + FN
FP
FPR = โโโโโโโโ (= 1 - Specificity)
FP + TN
How to Plot ROC Curve:
- Start with threshold = 1 (predict all Negative) โ TPR=0, FPR=0.
- Gradually lower threshold.
- More positives are predicted โ TPR increases, FPR also increases.
- At threshold = 0 (predict all Positive) โ TPR=1, FPR=1.
Diagram:
TPR
1 โ โญโโโโโโโโโโ
โ โญโโโฏ โ Good model (curves up-left)
โ โญโโโฏ
โโญโโโฏ
0 โโฏ______________
0 1 FPR
Diagonal line = Random classifier (AUC = 0.5)
Top-left area = Best classifier (AUC = 1.0)
AUC (Area Under the ROC Curve)
Definition:
AUC is the area under the ROC curve. It measures the overall ability of the model to discriminate between classes.
Interpretation:
| AUC Value | Meaning |
|---|---|
| AUC = 1.0 | Perfect classifier |
| AUC = 0.9โ1.0 | Excellent |
| AUC = 0.8โ0.9 | Good |
| AUC = 0.7โ0.8 | Fair |
| AUC = 0.6โ0.7 | Poor |
| AUC = 0.5 | Random guessing (no better than coin flip) |
| AUC < 0.5 | Worse than random |
Advantages of ROC-AUC:
- Works well with imbalanced datasets.
- Threshold-independent โ evaluates model across all thresholds.
- Single number summary of classifier performance.
4.9 Median Absolute Deviation (MAD)
Definition:
MAD is a robust measure of variability in predictions. It measures the median of the absolute differences between each prediction and the median prediction.
Steps:
Step 1: Find the median of all predictions/data: M = median(y)
Step 2: Calculate absolute deviations:
|yโ - M|, |yโ - M|, ..., |yโ - M|
Step 3: MAD = median of those absolute deviations
Formula:
MAD = median( |yแตข - median(y)| )
Example:
Data: [2, 3, 5, 7, 11]
Step 1: Median = 5
Step 2: Absolute deviations:
|2-5| = 3
|3-5| = 2
|5-5| = 0
|7-5| = 2
|11-5| = 6
Sorted deviations: [0, 2, 2, 3, 6]
Step 3: MAD = median([0,2,2,3,6]) = 2
Why MAD over Standard Deviation?
- Standard deviation is sensitive to outliers (uses squared differences).
- MAD is robust โ outliers don't distort it.
Applications:
- Regression model error analysis
- Outlier detection
- Robust statistics
4.10 Distribution of Errors
Definition:
Distribution of errors (also called residuals) refers to analyzing the pattern of prediction errors made by an ML model.
Error / Residual = Actual Value (y) - Predicted Value (ลท)
eแตข = yแตข - ลทแตข
Why Analyze Error Distribution?
- Tells us if the model is making systematic mistakes or random ones.
- Helps diagnose model problems.
Common Error Plots:
- Residual Plot: Plot errors vs predicted values. Good model = random scatter around zero.
Error
โ โ โ
โ โ โ โ Good (random scatter)
โโโผโโโโโโโโโโโโโโโ Predicted value
โ โ โ
- Histogram of Errors: Should be normally distributed (bell curve) with mean โ 0.
Count
โ โญโโโโฎ
โ โญโโโฏ โฐโโโฎ
โ โโโฏ โฐโโ
โโโโโโโโโโโโโโโโโโโโ Error
0
- Q-Q Plot: Compares error distribution to normal distribution.
Key Properties of a Well-Behaved Error Distribution:
| Property | Meaning |
|---|---|
| Mean โ 0 | No systematic bias |
| Normal shape | Errors are random, not patterned |
| Constant variance | Homoscedasticity (equal spread) |
| No autocorrelation | Errors are independent of each other |
Common Error Metrics for Regression:
| Metric | Formula | Meaning |
|---|---|---|
| MSE | (1/m) ฮฃ(yแตข-ลทแตข)ยฒ | Mean Squared Error โ penalizes big errors |
| RMSE | โMSE | Root MSE โ same unit as output |
| MAE | (1/m) ฮฃ|yแตข-ลทแตข| | Mean Absolute Error โ robust to outliers |
| MAD | median(|yแตข-median(y)|) | Median Absolute Deviation โ most robust |
Quick Revision Points
K-Means:
- Choose k โ Initialize centroids โ Assign points โ Update centroids โ Repeat.
- Distance: Euclidean.
- Best k: Elbow method (plot k vs WCSS).
- Problem: sensitive to initialization and outliers. Fix: K-Means++.
Ensemble Methods:
| Bagging | Boosting | Random Forest | |
|---|---|---|---|
| Training | Parallel | Sequential | Parallel |
| Reduces | Variance | Bias+Variance | Variance |
| Key param | B (trees) | T, learning rate | T, m (features) |
| Example | Bagged Trees | AdaBoost, XGBoost | Random Forest |
Evaluation Metrics:
Accuracy = (TP+TN) / (TP+TN+FP+FN)
Precision = TP / (TP+FP)
Recall = TP / (TP+FN)
F1-Score = 2รPรR / (P+R)
TPR = TP / (TP+FN) [= Recall]
FPR = FP / (FP+TN)
MAD = median(|yแตข - median(y)|)
ROC-AUC:
- ROC plots TPR vs FPR at different thresholds.
- AUC = area under ROC curve.
- AUC = 1.0 โ perfect; AUC = 0.5 โ random.
Error Distribution:
- Good model: errors are normally distributed, mean โ 0, random scatter.
- MAD is more robust than standard deviation for error analysis.
Expected Exam Questions
PYQs will be added after analysis โ check back soon.
These notes were compiled by Deepak Modi
Last updated: May 2026