Syllabus:
Supervised Learning: Definition, how it works. Types of Supervised learning algorithms: k-Nearest Neighbours, NaΓ―ve Bayes, Decision Trees, Linear Regression, Logistic Regression, Support Vector Machines.
π― PYQ Analysis for Unit 3
PYQs will be added after analysis β check back soon.
Section 1: Supervised Learning β Overview
1.1 Definition
Supervised Learning is a type of ML where the model is trained on a labeled dataset β every input (X) has a known correct output (Y). The model learns to map X β Y, and then uses that mapping to predict Y for new, unseen inputs.
How It Works:
Training Phase:
ββββββββββββββ
Labeled Data (X, Y) βββΊ ML Algorithm βββΊ Trained Model
Prediction Phase:
βββββββββββββββββ
New Input (X_new) βββΊ Trained Model βββΊ Predicted Output (ΕΆ)
1.2 Two Types of Supervised Learning
| Type | Output | Example |
|---|---|---|
| Classification | Discrete class label | Spam / Not Spam |
| Regression | Continuous numeric value | House price prediction |
1.3 General Workflow
- Collect labeled data.
- Split into Training set and Test set (e.g., 80/20 split).
- Choose an algorithm.
- Train the model on the training set.
- Evaluate on the test set.
- Tune and deploy.
Section 2: k-Nearest Neighbours (k-NN)
2.1 What is k-NN?
Definition:
k-Nearest Neighbours (k-NN) is a simple, non-parametric supervised learning algorithm. To classify a new data point, it looks at the k closest training points (neighbours) and assigns the majority class among them.
k-NN is called a lazy learner β it does not build an explicit model during training. It memorizes the entire training dataset.
Simple Analogy:
To guess which neighbourhood a house belongs to, you look at the 3 nearest houses (k=3). If 2 of 3 are in "Area A", you classify the house as "Area A".
2.2 How k-NN Works
Algorithm Steps:
1. Store all training data points.
2. For a new input point X_new:
a. Calculate distance from X_new to every training point.
b. Sort distances in ascending order.
c. Select the top k nearest neighbours.
d. Count the class labels of the k neighbours.
e. Assign the majority class as the prediction.
Diagram:
β β Legend:
β β β = Class A
β¦ β new point β = Class B
β β β¦ = query point
β β
k=3 nearest neighbours: β, β, β
Majority = β β Predict Class A
2.3 Distance Metrics
The "closeness" between two points is measured by a distance function.
Euclidean Distance (most common):
d(A, B) = β[(aβ-bβ)Β² + (aβ-bβ)Β² + ... + (aβ-bβ)Β²]
For 2D:
d(A, B) = β[(aβ-bβ)Β² + (aβ-bβ)Β²]
Manhattan Distance:
d(A, B) = |aβ-bβ| + |aβ-bβ| + ... + |aβ-bβ|
Minkowski Distance (generalizes both):
d(A, B) = [Ξ£|aα΅’-bα΅’|α΅]^(1/p)
p=1 β Manhattan
p=2 β Euclidean
2.4 Worked Example
Dataset:
| Point | xβ | xβ | Class |
|---|---|---|---|
| A | 1 | 2 | C1 |
| B | 2 | 3 | C1 |
| C | 3 | 1 | C2 |
| D | 5 | 4 | C2 |
Query point: Q = (3, 3), k = 3
Calculate Euclidean distances from Q to each point:
d(Q, A) = β[(3-1)Β² + (3-2)Β²] = β[4+1] = β5 β 2.24
d(Q, B) = β[(3-2)Β² + (3-3)Β²] = β[1+0] = β1 = 1.00
d(Q, C) = β[(3-3)Β² + (3-1)Β²] = β[0+4] = β4 = 2.00
d(Q, D) = β[(3-5)Β² + (3-4)Β²] = β[4+1] = β5 β 2.24
Sort and pick k=3 nearest:
1. B β distance 1.00 β Class C1
2. C β distance 2.00 β Class C2
3. A β distance 2.24 β Class C1 (tie broken by order)
Vote: C1: 2, C2: 1 β Predict Class C1
2.5 Choosing k
| k value | Effect |
|---|---|
| k=1 | Very sensitive to noise, can overfit |
| k=large | Smoother boundary, may underfit |
| Odd k | Avoids ties in binary classification |
| Best practice | Use cross-validation to find optimal k |
2.6 Advantages and Disadvantages
Advantages:
β
Simple, easy to understand and implement.
β
No training phase (lazy learner).
β
Naturally handles multi-class problems.
β
Adapts automatically to new training data.
Disadvantages:
β Slow at prediction time (computes all distances).
β High memory usage (stores entire training set).
β Sensitive to irrelevant features and scale.
β Struggles with high-dimensional data.
Applications:
- Recommendation systems
- Image recognition
- Medical diagnosis
- Anomaly detection
Section 3: NaΓ―ve Bayes
3.1 What is NaΓ―ve Bayes?
Definition:
NaΓ―ve Bayes is a probabilistic classification algorithm based on Bayes' Theorem. It assumes that all features are independent of each other given the class β this is the "naΓ―ve" assumption.
3.2 Bayes' Theorem
Formula:
P(X | C) Γ P(C)
P(C | X) = ββββββββββββββββββ
P(X)
Where:
P(C | X) = Posterior β probability of class C given features X
P(X | C) = Likelihood β probability of seeing X given class C
P(C) = Prior β probability of class C in training data
P(X) = Evidence β probability of seeing X (normalizing constant)
For Classification (we compare classes, so P(X) cancels):
P(C | X) β P(X | C) Γ P(C)
Pick class C that maximizes:
P(C | Xβ, Xβ, ..., Xβ)
NaΓ―ve independence assumption:
P(Xβ, Xβ, ..., Xβ | C) = P(Xβ|C) Γ P(Xβ|C) Γ ... Γ P(Xβ|C)
3.3 How NaΓ―ve Bayes Works
Steps:
1. Training:
a. Calculate P(C) for each class (prior).
b. For each feature and each class, calculate P(Xα΅’ | C) (likelihood).
2. Prediction (for new point X):
a. For each class C, compute:
Score(C) = P(C) Γ P(Xβ|C) Γ P(Xβ|C) Γ ... Γ P(Xβ|C)
b. Predict class with highest Score.
3.4 Worked Example
Dataset β Play Tennis:
| Day | Outlook | Humidity | Wind | Play? |
|---|---|---|---|---|
| 1 | Sunny | High | Weak | No |
| 2 | Sunny | High | Strong | No |
| 3 | Overcast | High | Weak | Yes |
| 4 | Rain | Normal | Weak | Yes |
| 5 | Rain | Normal | Strong | No |
| 6 | Overcast | Normal | Strong | Yes |
| 7 | Sunny | Normal | Weak | Yes |
Predict: Outlook=Sunny, Humidity=Normal, Wind=Weak β Play?
Step 1: Prior probabilities
P(Yes) = 4/7 β 0.571
P(No) = 3/7 β 0.429
Step 2: Likelihoods
For Yes (4 examples):
P(Outlook=Sunny | Yes) = 1/4 = 0.25
P(Humidity=Normal| Yes) = 3/4 = 0.75
P(Wind=Weak | Yes) = 3/4 = 0.75
For No (3 examples):
P(Outlook=Sunny | No) = 2/3 β 0.67
P(Humidity=Normal| No) = 1/3 β 0.33
P(Wind=Weak | No) = 1/3 β 0.33
Step 3: Posterior scores
Score(Yes) = P(Yes) Γ 0.25 Γ 0.75 Γ 0.75
= 0.571 Γ 0.1406
β 0.0803
Score(No) = P(No) Γ 0.67 Γ 0.33 Γ 0.33
= 0.429 Γ 0.0729
β 0.0313
Prediction: Yes (0.0803 > 0.0313) β
3.5 Laplace Smoothing
Problem: If any P(Xα΅’|C) = 0 (feature never seen in training), the whole product = 0.
Solution β Laplace Smoothing (add-1 smoothing):
count(Xα΅’, C) + 1
P(Xα΅’ | C) = βββββββββββββββββββββ
count(C) + |vocab|
3.6 Types of NaΓ―ve Bayes
| Type | Feature Distribution | Use Case |
|---|---|---|
| Gaussian NB | Continuous, normal distribution | Iris flower classification |
| Multinomial NB | Discrete counts | Text classification |
| Bernoulli NB | Binary features | Spam detection |
3.7 Advantages and Disadvantages
Advantages:
β
Very fast β simple multiplication of probabilities.
β
Works well with small datasets.
β
Handles high-dimensional data (text classification).
β
Naturally handles multi-class problems.
Disadvantages:
β NaΓ―ve independence assumption is rarely true in practice.
β Poor estimator of probabilities (scores, not calibrated).
β Struggles with feature interactions.
Applications:
- Spam filtering
- Sentiment analysis
- Document classification
- Medical diagnosis
Section 4: Decision Trees
4.1 What is a Decision Tree?
Definition:
A Decision Tree is a tree-structured model where:
- Each internal node represents a test on a feature.
- Each branch represents the outcome of the test.
- Each leaf node represents a class label (classification) or a value (regression).
Simple Analogy:
Think of a decision tree like a flowchart or a 20-questions game β at each step you ask a yes/no question and follow the appropriate branch until you reach an answer.
4.2 Structure of a Decision Tree
[Outlook?] β Root Node
/ | \
Sunny/ Overcast\ Rain\
/ | \
[Humidity?] [Yes β] [Wind?] β Internal Nodes
/ \ / \
High Normal Strong Weak
/ \ | |
[No β] [Yes β] [No β] [Yes β] β Leaf Nodes
4.3 How to Build a Decision Tree β Key Concepts
Entropy (Measure of Impurity)
Definition: Entropy measures the impurity or disorder in a dataset. A pure node (all same class) has entropy = 0.
Formula:
c
H(S) = - Ξ£ pα΅’ Γ logβ(pα΅’)
i=1
Where:
S = dataset
c = number of classes
pα΅’ = proportion of class i in S
Examples:
Pure node (all Yes): H = -(1Γlogβ1) = 0
50-50 split: H = -(0.5Γlogβ0.5 + 0.5Γlogβ0.5) = 1
Information Gain (Choosing the Best Feature)
Definition: Information Gain measures how much a feature reduces entropy (disorder). The feature with the highest information gain is chosen as the splitting node.
Formula:
IG(S, A) = H(S) - Ξ£ [ |Sα΅₯|/|S| Γ H(Sα΅₯) ]
vβA
Where:
A = feature being tested
Sα΅₯ = subset where feature A = value v
Gini Impurity (Alternative to Entropy)
Formula:
c
Gini = 1 - Ξ£ pα΅’Β²
i=1
Pure node: Gini = 0
50-50: Gini = 1 - (0.5Β² + 0.5Β²) = 0.5
4.4 ID3 Algorithm
ID3 (Iterative Dichotomiser 3) is the classic decision tree building algorithm using Information Gain.
Steps:
1. If all examples have same class β return leaf node with that class.
2. If no features left β return leaf with majority class.
3. Else:
a. Calculate Information Gain for each feature.
b. Select feature with highest IG as root.
c. For each value of that feature:
- Create a sub-branch.
- Recursively apply steps 1β3 on the subset.
4.5 Worked Example β Entropy & Info Gain
Dataset (14 examples, 9 Yes, 5 No):
Outlook: Sunny(5): 2Yes, 3No
Overcast(4): 4Yes, 0No
Rain(5): 3Yes, 2No
Step 1: Entropy of full dataset S:
H(S) = -(9/14)Γlogβ(9/14) - (5/14)Γlogβ(5/14)
= -(0.643Γ(-0.637)) - (0.357Γ(-1.485))
= 0.410 + 0.530
= 0.940
Step 2: Entropy of each subset for Outlook:
H(Sunny) = -(2/5)logβ(2/5) - (3/5)logβ(3/5)
= -(0.4Γ(-1.322)) - (0.6Γ(-0.737))
= 0.529 + 0.442 = 0.971
H(Overcast) = -(4/4)logβ(4/4) = -(1Γ0) = 0.000 (pure node)
H(Rain) = -(3/5)logβ(3/5) - (2/5)logβ(2/5)
= 0.971
Step 3: Information Gain for Outlook:
IG(S, Outlook) = H(S) - [5/14ΓH(Sunny) + 4/14ΓH(Overcast) + 5/14ΓH(Rain)]
= 0.940 - [5/14Γ0.971 + 4/14Γ0 + 5/14Γ0.971]
= 0.940 - [0.347 + 0 + 0.347]
= 0.940 - 0.694
= 0.246
If Outlook has the highest IG among all features β it becomes the root node.
4.6 Overfitting and Pruning
- A decision tree can grow too deep and overfit (memorize training data).
- Pruning cuts unnecessary branches:
- Pre-pruning: Stop growing when improvement is small.
- Post-pruning: Grow full tree, then remove low-value branches.
4.7 Advantages and Disadvantages
Advantages:
β
Easy to understand and visualize.
β
No need for feature scaling.
β
Handles both numerical and categorical data.
β
Implicitly performs feature selection.
Disadvantages:
β Prone to overfitting (especially deep trees).
β Unstable β small changes in data can change the tree.
β Biased toward features with more values.
Applications:
- Medical diagnosis
- Credit risk scoring
- Fraud detection
- Customer segmentation
Section 5: Linear Regression
5.1 What is Linear Regression?
Definition:
Linear Regression is a supervised learning algorithm used for regression (predicting continuous values). It models the relationship between input features (X) and the output (Y) as a straight line (linear equation).
5.2 The Linear Equation
Simple Linear Regression (one feature):
ΕΆ = wβ + wβX
Where:
ΕΆ = predicted value
X = input feature
wβ = intercept (bias) β where the line crosses Y-axis
wβ = slope (weight) β how much Y changes per unit of X
Multiple Linear Regression (n features):
ΕΆ = wβ + wβXβ + wβXβ + ... + wβXβ
Or in matrix form:
ΕΆ = Xw
Diagram:
Y
β /
β / β Regression Line (ΕΆ = wβ + wβX)
β / β
β /β
β / β
β / β
β/β
ββββββββββββββ X
5.3 Cost Function β Mean Squared Error (MSE)
We measure how well the line fits using Mean Squared Error:
1 m
MSE = βββ Ξ£ (yα΅’ - Ε·α΅’)Β²
m i=1
Where:
yα΅’ = actual value
Ε·α΅’ = predicted value
m = number of samples
The goal of training is to minimize MSE by finding the best values of wβ and wβ.
5.4 Finding Optimal Weights
Method 1: Ordinary Least Squares (Closed-form)
Analytical solution:
w = (Xα΅X)β»ΒΉ Xα΅y
Best for small datasets with few features.
Method 2: Gradient Descent (Iterative)
Repeat until convergence:
wβ := wβ - Ξ± Γ βMSE/βwβ
wβ := wβ - Ξ± Γ βMSE/βwβ
Where Ξ± = learning rate
Partial derivatives:
βMSE/βwβ = (-2/m) Ξ£(yα΅’ - Ε·α΅’)
βMSE/βwβ = (-2/m) Ξ£(yα΅’ - Ε·α΅’)Γxα΅’
5.5 Worked Example
Data:
| X (hours studied) | Y (marks) |
|---|---|
| 1 | 50 |
| 2 | 60 |
| 3 | 70 |
| 4 | 80 |
Using formulas:
n = 4
Ξ£X = 1+2+3+4 = 10
Ξ£Y = 50+60+70+80 = 260
Ξ£XY = (1Γ50)+(2Γ60)+(3Γ70)+(4Γ80) = 50+120+210+320 = 700
Ξ£XΒ² = 1+4+9+16 = 30
wβ = (nΞ£XY - Ξ£XΞ£Y) / (nΞ£XΒ² - (Ξ£X)Β²)
= (4Γ700 - 10Γ260) / (4Γ30 - 100)
= (2800 - 2600) / (120 - 100)
= 200 / 20
= 10
wβ = (Ξ£Y - wβΞ£X) / n
= (260 - 10Γ10) / 4
= (260 - 100) / 4
= 40
Equation: ΕΆ = 40 + 10X
Predict for X=5: ΕΆ = 40 + 10Γ5 = 90 marks
5.6 Advantages and Disadvantages
Advantages:
β
Simple and fast.
β
Easily interpretable (slope tells effect of each feature).
β
Works well when relationship is truly linear.
Disadvantages:
β Assumes a linear relationship (fails on non-linear data).
β Sensitive to outliers.
β Assumes features are independent (multicollinearity hurts it).
Section 6: Logistic Regression
6.1 What is Logistic Regression?
Definition:
Logistic Regression is a supervised learning algorithm used for binary classification (output is 0 or 1). Despite the name, it is a classification algorithm, not regression.
It models the probability that an input belongs to a class using the sigmoid function.
6.2 The Sigmoid Function
1
Ο(z) = ββββββββββ
1 + e^(-z)
Where:
z = wβ + wβXβ + wβXβ + ... = linear combination of inputs
Output: always between 0 and 1 (interpreted as probability)
6.3 Decision Rule
β 1 (Class 1) if P(Y=1|X) β₯ 0.5 [i.e., z β₯ 0]
ΕΆ = β
β 0 (Class 0) if P(Y=1|X) < 0.5 [i.e., z < 0]
The decision boundary is the line where z = 0.
6.4 Cost Function β Log Loss (Binary Cross-Entropy)
J(w) = -(1/m) Ξ£ [yα΅’ log(Ε·α΅’) + (1-yα΅’) log(1-Ε·α΅’)]
Where:
yα΅’ = actual label (0 or 1)
Ε·α΅’ = predicted probability
Weights are updated using gradient descent to minimize J(w).
6.5 Linear Regression vs Logistic Regression
| Feature | Linear Regression | Logistic Regression |
|---|---|---|
| Task | Regression | Classification |
| Output | Continuous value | Probability (0β1) |
| Function | Linear (y = wx + b) | Sigmoid (Ο(z)) |
| Loss | MSE | Log Loss |
| Decision | No threshold | Threshold at 0.5 |
| Example | Predict house price | Predict spam or not |
6.6 Advantages and Disadvantages
Advantages:
β
Probabilistic output (useful for confidence scores).
β
Fast to train and predict.
β
Easy to interpret with feature weights.
β
Works well for linearly separable data.
Disadvantages:
β Assumes linear decision boundary.
β Fails on complex, non-linear problems.
β Sensitive to outliers and correlated features.
Applications:
- Email spam detection
- Disease prediction (diabetes, cancer risk)
- Credit approval
- Customer churn prediction
Section 7: Support Vector Machines (SVM)
7.1 What is SVM?
Definition:
A Support Vector Machine (SVM) is a supervised learning algorithm that finds the best hyperplane (decision boundary) that maximally separates the two classes.
Key Idea: Don't just find any line that separates classes β find the one with the largest margin (gap) between classes.
7.2 Key Concepts
Hyperplane
A hyperplane is a decision boundary that separates two classes.
- In 2D: a line.
- In 3D: a plane.
- In n-D: a hyperplane.
Equation:
wΒ·x + b = 0
Where:
w = weight vector (normal to hyperplane)
x = input feature vector
b = bias
Support Vectors
Support Vectors are the data points closest to the hyperplane (from each class). They are the critical points β they "support" or define the hyperplane.
Margin
The margin is the total distance between the two parallel boundary lines (one touching each class's closest points).
Margin
ββββββββ
βββββββββββββββββββ β Upper boundary (wΒ·x + b = +1)
β
β β β
ββββββββββββββββββββ β Optimal Hyperplane (wΒ·x + b = 0)
β β β
β
βββββββββββββββββββ β Lower boundary (wΒ·x + b = -1)
β = Class +1 β = Class -1
Margin width:
2
M = βββββ
||w||
SVM maximizes M β minimizes ||w||.
7.3 Hard Margin vs Soft Margin
| Hard Margin SVM | Soft Margin SVM | |
|---|---|---|
| Assumption | Data is perfectly separable | Allows some misclassifications |
| Constraint | All points outside margin | Some points can be inside margin |
| Parameter | None | C (penalty parameter) |
| Sensitivity | Very sensitive to outliers | More robust |
C parameter (regularization):
- High C β Small margin, fewer errors on training data (risk of overfit).
- Low C β Wide margin, allows more errors (better generalization).
7.4 The Kernel Trick
Problem: Data is not always linearly separable in the original feature space.
Solution: The kernel trick maps data to a higher-dimensional space where it becomes separable, without explicitly computing the transformation.
Original 2D space Higher-dimensional space
(not separable) (separable)
ββ β β β β
β β β βββββΊ βββββββββ
ββ β β β β
Common Kernels:
| Kernel | Formula | Use Case |
|---|---|---|
| Linear | K(x,y) = xα΅y | Linearly separable data |
| Polynomial | K(x,y) = (xα΅y + c)α΅ | Polynomial boundaries |
| RBF / Gaussian | K(x,y) = exp(-Ξ³ | |
| Sigmoid | K(x,y) = tanh(Ξ±xα΅y + c) | Neural network-like |
7.5 SVM for Multi-Class
SVM is inherently binary. Two strategies to extend it:
- One-vs-One (OvO): Train one SVM per pair of classes.
- One-vs-All (OvA): Train one SVM per class vs all others.
7.6 Advantages and Disadvantages
Advantages:
β
Works well with high-dimensional data.
β
Effective when number of features > number of samples.
β
Memory efficient (only support vectors matter).
β
Kernel trick handles non-linear problems.
β
Robust to overfitting (especially with soft margin).
Disadvantages:
β Slow on large datasets (quadratic programming).
β Sensitive to feature scaling (needs normalization).
β Choosing the right kernel and C is tricky.
β Hard to interpret compared to Decision Trees.
Applications:
- Image classification
- Text and document categorization
- Bioinformatics (protein classification)
- Face detection
Quick Revision Points
Algorithms at a Glance:
| Algorithm | Type | Key Idea | Key Formula |
|---|---|---|---|
| k-NN | Classification | Majority vote of k nearest neighbours | Euclidean distance |
| NaΓ―ve Bayes | Classification | Bayes theorem + independence assumption | P(C|X) β P(X|C)P(C) |
| Decision Tree | Both | Split on highest info gain | IG = H(S) - Ξ£H(Sα΅₯) |
| Linear Regression | Regression | Fit a line to minimize error | ΕΆ = wβ + wβX |
| Logistic Regression | Classification | Sigmoid maps to probability | Ο(z) = 1/(1+e^-z) |
| SVM | Both | Maximize margin between classes | Maximize 2/||w|| |
Key Formulas:
Euclidean Distance: d = β[Ξ£(aα΅’-bα΅’)Β²]
Entropy: H = -Ξ£ pα΅’ logβ(pα΅’)
Information Gain: IG = H(S) - Ξ£|Sα΅₯|/|S| H(Sα΅₯)
Sigmoid: Ο(z) = 1 / (1 + e^(-z))
SVM Margin: M = 2 / ||w||
Linear Regression: ΕΆ = wβ + wβX
MSE: (1/m) Ξ£(yα΅’ - Ε·α΅’)Β²
Classification vs Regression:
| Algorithm | Classification | Regression |
|---|---|---|
| k-NN | β | β |
| NaΓ―ve Bayes | β | β |
| Decision Tree | β | β |
| Linear Regression | β | β |
| Logistic Regression | β | β |
| SVM | β | β (SVR) |
Expected Exam Questions
PYQs will be added after analysis β check back soon.
These notes were compiled by Deepak Modi
Last updated: May 2026