Syllabus:
Supervised Learning: Definition, how it works. Types of Supervised learning algorithms: k-Nearest Neighbours, NaΓ―ve Bayes, Decision Trees, Linear Regression, Logistic Regression, Support Vector Machines.
π― PYQ Analysis for Unit 3
High Priority Topics (15 marks questions)
- Decision Trees (Algorithm + Types) β (2024: 15 marks, 2023: 15 marks)
- Naive Bayes (Assumptions, Working, Why "Naive") β (2023: 15 marks, 2022: 7.5 marks)
- Support Vector Machine (Goal + Margin) β (2022: 15 marks)
- Supervised Learning (Definition + Working Principle) β (2024: 15 marks)
- Importance of Labelled Data + Decision Trees β (2024: 15 marks)
- Regression (with Example) β (2022: 7.5 marks)
Medium Priority Topics (Short answers)
- Support Vectors β 2023 (2.5 marks)
- K-NN as Lazy Learner β 2023 (2.5 marks)
- Underfitting β 2023 (2.5 marks)
- Logistic Regression β 2023 (2.5 marks)
- Linear Regression β 2022 (2.5 marks)
Section 1: Supervised Learning β Overview
PYQ: Define Supervised Learning and explain its working principle. Provide a step-by-step example of how a supervised learning algorithm processes training data and makes predictions. (2024, 15 marks)
PYQ: Discuss the importance of labelled data in Supervised Learning and its role in training predictive models. Also, explain Decision Trees by taking a suitable example. (2024, 15 marks)
PYQ: Underfitting. (2023, 2.5 marks)
1.1 Definition
Supervised Learning is a type of ML where the model is trained on a labeled dataset β every input (X) has a known correct output (Y). The model learns to map X β Y, and then uses that mapping to predict Y for new, unseen inputs.
How It Works:
Training Phase:
ββββββββββββββ
Labeled Data (X, Y) βββΊ ML Algorithm βββΊ Trained Model
Prediction Phase:
βββββββββββββββββ
New Input (X_new) βββΊ Trained Model βββΊ Predicted Output (ΕΆ)
1.2 Two Types of Supervised Learning
| Type | Output | Example |
|---|---|---|
| Classification | Discrete class label | Spam / Not Spam |
| Regression | Continuous numeric value | House price prediction |
1.3 General Workflow
- Collect labeled data.
- Split into Training set and Test set (e.g., 80/20 split).
- Choose an algorithm.
- Train the model on the training set.
- Evaluate on the test set.
- Tune and deploy.
1.4 Working Principle of Supervised Learning
PYQ: Explain the working principle of Supervised Learning. (2024, part of 15 marks)
The general workflow of a supervised learning algorithm consists of the following steps:
| Step | Description |
|---|---|
| 1. Data Collection | Gather labelled data (features + correct outputs) |
| 2. Data Preprocessing | Clean and prepare the data (handle missing values, encode categories, normalize if needed) |
| 3. Model Selection | Choose a suitable algorithm (e.g., Linear Regression, Decision Tree, SVM) |
| 4. Training | Feed the inputβoutput pairs to the model so it can learn the mapping between them |
| 5. Evaluation | Test the model on unseen data using metrics like accuracy, precision, recall, RMSE |
| 6. Prediction | Use the trained model to make predictions on completely new data |
1.5 Step-by-Step Example β House Price Prediction
PYQ: Provide a step-by-step example of how a supervised learning algorithm processes training data and makes predictions. (2024, part of 15 marks)
Let's walk through a complete supervised learning workflow with a regression example β predicting the price of a house based on its size (in square feet).
Step 1: Dataset (Labelled)
| House Size (sqft) | Price ($) |
|---|---|
| 1000 | 200,000 |
| 1500 | 300,000 |
| 2000 | 400,000 |
| 2500 | 500,000 |
Here:
- Feature (X): House Size
- Label (Y): Price
Step 2: Choose Model
Use Linear Regression, assuming a relationship of the form:
Price = m Γ Size + c
Step 3: Training
The algorithm computes the best-fit line by finding the optimal values of m (slope) and c (intercept) that minimize the prediction error.
For this dataset, it learns:
Price = 200 Γ Size
(That is, m = 200 and c = 0.)
Step 4: Prediction
For a new house with Size = 1800 sqft, the model predicts:
Price = 200 Γ 1800 = $360,000
Step 5: Evaluation
The trained model is tested on unseen data using error metrics:
- Mean Squared Error (MSE)
- RΒ² Score (for regression)
- Accuracy / Precision / Recall (for classification)
Key Algorithms in Supervised Learning
| Algorithm | Use Case |
|---|---|
| Linear Regression | Predict continuous values |
| Logistic Regression | Binary classification |
| Decision Tree | Both classification and regression |
| K-Nearest Neighbors (KNN) | Classification |
| Support Vector Machine (SVM) | Classification with clear margins |
| NaΓ―ve Bayes | Text classification |
Benefits of Supervised Learning
- β Clear objective due to labelled data
- β Accurate predictions with sufficient training data
- β Easy to evaluate using performance metrics
Limitations
- β Requires a large amount of labelled data (expensive to obtain)
- β Cannot handle classes/situations not seen during training
- β May overfit or underfit depending on model complexity
1.6 Importance of Labelled Data in Supervised Learning
PYQ: Discuss the importance of labelled data in Supervised Learning and its role in training predictive models. (2024, part of 15 marks)
Supervised learning cannot exist without labelled data β that is, every input must come with a corresponding correct output. Examples:
- Predicting house prices β label = price
- Classifying emails β label = spam / ham
Role of Labelled Data in Supervised Learning
| Role of Labelled Data | Explanation |
|---|---|
| Guides Model Learning | Helps the algorithm understand the mapping between input features and the desired output |
| Error Measurement | Enables computation of loss / error during training, which is then minimized through optimization |
| Model Validation | Used in test / validation sets to evaluate model accuracy and generalization |
| Improves Accuracy | More and more diverse labelled examples help the model learn patterns better |
| Supports Fine-Tuning | Allows parameter tuning based on labelled feedback during cross-validation |
Without labelled data, supervised learning models cannot learn predictive tasks β labels are the foundation of this learning paradigm.
Role of Labelled Data in Building Predictive Models
When building a predictive model, labelled data is essential in every phase:
- Training Phase β the model learns patterns from features paired with their correct outputs.
- Testing Phase β labelled test data is used to validate how well the model generalizes to unseen examples.
- Performance Metrics β labels enable calculation of accuracy, precision, recall, F1-score, etc.
- Supervised Feedback Loop β the model updates its parameters based on how close (or far) its predictions are from the actual labels.
Bottom line: Labelled data is the teacher in supervised learning β without it, the model has nothing to learn from and nothing to be measured against.
Section 2: k-Nearest Neighbours (k-NN)
PYQ: K-NN is called Lazy Learner. Explain it. (2023, 2.5 marks)
2.1 What is k-NN?
Definition:
k-Nearest Neighbours (k-NN) is a simple, non-parametric supervised learning algorithm. To classify a new data point, it looks at the k closest training points (neighbours) and assigns the majority class among them.
k-NN is called a lazy learner β it does not build an explicit model during training. It memorizes the entire training dataset.
Simple Analogy:
To guess which neighbourhood a house belongs to, you look at the 3 nearest houses (k=3). If 2 of 3 are in "Area A", you classify the house as "Area A".
2.2 How k-NN Works
Algorithm Steps:
1. Store all training data points.
2. For a new input point X_new:
a. Calculate distance from X_new to every training point.
b. Sort distances in ascending order.
c. Select the top k nearest neighbours.
d. Count the class labels of the k neighbours.
e. Assign the majority class as the prediction.
Diagram:
β β Legend:
β β β = Class A
β¦ β new point β = Class B
β β β¦ = query point
β β
k=3 nearest neighbours: β, β, β
Majority = β β Predict Class A
2.3 Distance Metrics
The "closeness" between two points is measured by a distance function.
Euclidean Distance (most common):
d(A, B) = β[(aβ-bβ)Β² + (aβ-bβ)Β² + ... + (aβ-bβ)Β²]
For 2D:
d(A, B) = β[(aβ-bβ)Β² + (aβ-bβ)Β²]
Manhattan Distance:
d(A, B) = |aβ-bβ| + |aβ-bβ| + ... + |aβ-bβ|
Minkowski Distance (generalizes both):
d(A, B) = [Ξ£|aα΅’-bα΅’|α΅]^(1/p)
p=1 β Manhattan
p=2 β Euclidean
2.4 Worked Example
Dataset:
| Point | xβ | xβ | Class |
|---|---|---|---|
| A | 1 | 2 | C1 |
| B | 2 | 3 | C1 |
| C | 3 | 1 | C2 |
| D | 5 | 4 | C2 |
Query point: Q = (3, 3), k = 3
Calculate Euclidean distances from Q to each point:
d(Q, A) = β[(3-1)Β² + (3-2)Β²] = β[4+1] = β5 β 2.24
d(Q, B) = β[(3-2)Β² + (3-3)Β²] = β[1+0] = β1 = 1.00
d(Q, C) = β[(3-3)Β² + (3-1)Β²] = β[0+4] = β4 = 2.00
d(Q, D) = β[(3-5)Β² + (3-4)Β²] = β[4+1] = β5 β 2.24
Sort and pick k=3 nearest:
1. B β distance 1.00 β Class C1
2. C β distance 2.00 β Class C2
3. A β distance 2.24 β Class C1 (tie broken by order)
Vote: C1: 2, C2: 1 β Predict Class C1
2.5 Choosing k
| k value | Effect |
|---|---|
| k=1 | Very sensitive to noise, can overfit |
| k=large | Smoother boundary, may underfit |
| Odd k | Avoids ties in binary classification |
| Best practice | Use cross-validation to find optimal k |
2.6 Advantages and Disadvantages
Advantages:
β
Simple, easy to understand and implement.
β
No training phase (lazy learner).
β
Naturally handles multi-class problems.
β
Adapts automatically to new training data.
Disadvantages:
β Slow at prediction time (computes all distances).
β High memory usage (stores entire training set).
β Sensitive to irrelevant features and scale.
β Struggles with high-dimensional data.
Applications:
- Recommendation systems
- Image recognition
- Medical diagnosis
- Anomaly detection
Section 3: NaΓ―ve Bayes
PYQ: Explain Naive Bayes Classifier. (2022, 7.5 marks)
PYQ: What are the basic assumptions used in Naive Bayes classifier? Why is it called Naive Bayes? Also discuss the working of Naive Bayes algorithm. (2023, 15 marks)
3.1 What is NaΓ―ve Bayes?
Definition:
NaΓ―ve Bayes is a probabilistic classification algorithm based on Bayes' Theorem. It assumes that all features are independent of each other given the class β this is the "naΓ―ve" assumption.
3.2 Why is it Called "NaΓ―ve Bayes"?
PYQ: Why is it called Naive Bayes? (2023, part of 15 marks)
The name NaΓ―ve Bayes has two parts β each with a clear meaning:
"NaΓ―ve" β because of the independence assumption
It is called naΓ―ve because it makes a strong (unrealistic) assumption that every pair of features is independent of each other, given the class label. In the real world, features are usually correlated β so this assumption is almost always violated, yet the algorithm still works surprisingly well in practice.
Example: To identify a fruit as an apple, the model checks features like colour = red, shape = round, taste = sweet β and assumes these features are completely independent of each other. In reality, "red" and "sweet" might be related, but the NaΓ―ve Bayes classifier ignores that link.
"Bayes" β because it uses Bayes' Theorem
It is called Bayes because the algorithm is built directly on the principle of Bayes' Theorem β a fundamental result in probability that lets us compute the probability of a hypothesis given some observed evidence.
3.3 Basic Assumptions of NaΓ―ve Bayes
PYQ: What are the basic assumptions used in Naive Bayes classifier? (2023, part of 15 marks)
The NaΓ―ve Bayes classifier makes the following fundamental assumptions about the data:
1. Feature Independence (the "naΓ―ve" assumption)
The features of the data are conditionally independent of each other, given the class label. Knowing the value of one feature gives no information about another feature within the same class.
2. Continuous Features are Normally Distributed
If a feature is continuous (numeric), it is assumed to follow a normal (Gaussian) distribution within each class. This is the basis for Gaussian NaΓ―ve Bayes.
3. Discrete Features Follow Multinomial Distribution
If a feature is discrete (e.g., word counts in text classification), it is assumed to follow a multinomial distribution within each class. This is the basis for Multinomial NaΓ―ve Bayes.
4. Features are Equally Important
All features are assumed to contribute equally to the prediction of the class label β there is no built-in feature weighting.
5. No Missing Data
The dataset is assumed to contain no missing values. Missing data must be imputed or removed before training.
3.4 Bayes' Theorem
Formula:
P(X | C) Γ P(C)
P(C | X) = ββββββββββββββββββ
P(X)
Where:
P(C | X) = Posterior β probability of class C given features X
P(X | C) = Likelihood β probability of seeing X given class C
P(C) = Prior β probability of class C in training data
P(X) = Evidence β probability of seeing X (normalizing constant)
For Classification (we compare classes, so P(X) cancels):
P(C | X) β P(X | C) Γ P(C)
Pick class C that maximizes:
P(C | Xβ, Xβ, ..., Xβ)
NaΓ―ve independence assumption:
P(Xβ, Xβ, ..., Xβ | C) = P(Xβ|C) Γ P(Xβ|C) Γ ... Γ P(Xβ|C)
3.5 How NaΓ―ve Bayes Works
Steps:
1. Training:
a. Calculate P(C) for each class (prior).
b. For each feature and each class, calculate P(Xα΅’ | C) (likelihood).
2. Prediction (for new point X):
a. For each class C, compute:
Score(C) = P(C) Γ P(Xβ|C) Γ P(Xβ|C) Γ ... Γ P(Xβ|C)
b. Predict class with highest Score.
3.6 Worked Example
Dataset β Play Tennis:
| Day | Outlook | Humidity | Wind | Play? |
|---|---|---|---|---|
| 1 | Sunny | High | Weak | No |
| 2 | Sunny | High | Strong | No |
| 3 | Overcast | High | Weak | Yes |
| 4 | Rain | Normal | Weak | Yes |
| 5 | Rain | Normal | Strong | No |
| 6 | Overcast | Normal | Strong | Yes |
| 7 | Sunny | Normal | Weak | Yes |
Predict: Outlook=Sunny, Humidity=Normal, Wind=Weak β Play?
Step 1: Prior probabilities
P(Yes) = 4/7 β 0.571
P(No) = 3/7 β 0.429
Step 2: Likelihoods
For Yes (4 examples):
P(Outlook=Sunny | Yes) = 1/4 = 0.25
P(Humidity=Normal| Yes) = 3/4 = 0.75
P(Wind=Weak | Yes) = 3/4 = 0.75
For No (3 examples):
P(Outlook=Sunny | No) = 2/3 β 0.67
P(Humidity=Normal| No) = 1/3 β 0.33
P(Wind=Weak | No) = 1/3 β 0.33
Step 3: Posterior scores
Score(Yes) = P(Yes) Γ 0.25 Γ 0.75 Γ 0.75
= 0.571 Γ 0.1406
β 0.0803
Score(No) = P(No) Γ 0.67 Γ 0.33 Γ 0.33
= 0.429 Γ 0.0729
β 0.0313
Prediction: Yes (0.0803 > 0.0313) β
3.7 Laplace Smoothing
Problem: If any P(Xα΅’|C) = 0 (feature never seen in training), the whole product = 0.
Solution β Laplace Smoothing (add-1 smoothing):
count(Xα΅’, C) + 1
P(Xα΅’ | C) = βββββββββββββββββββββ
count(C) + |vocab|
3.8 Types of NaΓ―ve Bayes
| Type | Feature Distribution | Use Case |
|---|---|---|
| Gaussian NB | Continuous, normal distribution | Iris flower classification |
| Multinomial NB | Discrete counts | Text classification |
| Bernoulli NB | Binary features | Spam detection |
3.9 Advantages and Disadvantages
Advantages:
β
Very fast β simple multiplication of probabilities.
β
Works well with small datasets.
β
Handles high-dimensional data (text classification).
β
Naturally handles multi-class problems.
Disadvantages:
β NaΓ―ve independence assumption is rarely true in practice.
β Poor estimator of probabilities (scores, not calibrated).
β Struggles with feature interactions.
Applications:
- Spam filtering
- Sentiment analysis
- Document classification
- Medical diagnosis
Section 4: Decision Trees
PYQ: Explain Decision Trees algorithm in ML and discuss its types. (2023, 15 marks)
PYQ: Explain Decision Trees by taking a suitable example. (2024, part of 15 marks)
4.1 What is a Decision Tree?
Definition:
A Decision Tree is a tree-structured model where:
- Each internal node represents a test on a feature.
- Each branch represents the outcome of the test.
- Each leaf node represents a class label (classification) or a value (regression).
Simple Analogy:
Think of a decision tree like a flowchart or a 20-questions game β at each step you ask a yes/no question and follow the appropriate branch until you reach an answer.
4.2 Structure of a Decision Tree
[Outlook?] β Root Node
/ | \
Sunny/ Overcast\ Rain\
/ | \
[Humidity?] [Yes β] [Wind?] β Internal Nodes
/ \ / \
High Normal Strong Weak
/ \ | |
[No β] [Yes β] [No β] [Yes β] β Leaf Nodes
4.3 How to Build a Decision Tree β Key Concepts
Entropy (Measure of Impurity)
Definition: Entropy measures the impurity or disorder in a dataset. A pure node (all same class) has entropy = 0.
Formula:
c
H(S) = - Ξ£ pα΅’ Γ logβ(pα΅’)
i=1
Where:
S = dataset
c = number of classes
pα΅’ = proportion of class i in S
Examples:
Pure node (all Yes): H = -(1Γlogβ1) = 0
50-50 split: H = -(0.5Γlogβ0.5 + 0.5Γlogβ0.5) = 1
Information Gain (Choosing the Best Feature)
Definition: Information Gain measures how much a feature reduces entropy (disorder). The feature with the highest information gain is chosen as the splitting node.
Formula:
IG(S, A) = H(S) - Ξ£ [ |Sα΅₯|/|S| Γ H(Sα΅₯) ]
vβA
Where:
A = feature being tested
Sα΅₯ = subset where feature A = value v
Gini Impurity (Alternative to Entropy)
Formula:
c
Gini = 1 - Ξ£ pα΅’Β²
i=1
Pure node: Gini = 0
50-50: Gini = 1 - (0.5Β² + 0.5Β²) = 0.5
4.4 ID3 Algorithm
ID3 (Iterative Dichotomiser 3) is the classic decision tree building algorithm using Information Gain.
Steps:
1. If all examples have same class β return leaf node with that class.
2. If no features left β return leaf with majority class.
3. Else:
a. Calculate Information Gain for each feature.
b. Select feature with highest IG as root.
c. For each value of that feature:
- Create a sub-branch.
- Recursively apply steps 1β3 on the subset.
4.5 Worked Example β Entropy & Info Gain
Dataset (14 examples, 9 Yes, 5 No):
Outlook: Sunny(5): 2Yes, 3No
Overcast(4): 4Yes, 0No
Rain(5): 3Yes, 2No
Step 1: Entropy of full dataset S:
H(S) = -(9/14)Γlogβ(9/14) - (5/14)Γlogβ(5/14)
= -(0.643Γ(-0.637)) - (0.357Γ(-1.485))
= 0.410 + 0.530
= 0.940
Step 2: Entropy of each subset for Outlook:
H(Sunny) = -(2/5)logβ(2/5) - (3/5)logβ(3/5)
= -(0.4Γ(-1.322)) - (0.6Γ(-0.737))
= 0.529 + 0.442 = 0.971
H(Overcast) = -(4/4)logβ(4/4) = -(1Γ0) = 0.000 (pure node)
H(Rain) = -(3/5)logβ(3/5) - (2/5)logβ(2/5)
= 0.971
Step 3: Information Gain for Outlook:
IG(S, Outlook) = H(S) - [5/14ΓH(Sunny) + 4/14ΓH(Overcast) + 5/14ΓH(Rain)]
= 0.940 - [5/14Γ0.971 + 4/14Γ0 + 5/14Γ0.971]
= 0.940 - [0.347 + 0 + 0.347]
= 0.940 - 0.694
= 0.246
If Outlook has the highest IG among all features β it becomes the root node.
4.6 Types of Decision Tree Algorithm
PYQ: Explain Decision Trees algorithm in ML and discuss its types. (2023, 15 marks)
Based on the type of target variable they predict, decision trees fall into two main categories:
(1) Classification Tree
A Classification Tree is used to classify data into discrete classes or categories.
- Target variable: categorical (e.g., Yes/No, Spam/Not Spam, Cat/Dog/Bird).
- Splitting criterion: Information Gain (entropy) or Gini Impurity β picks the feature that produces the purest class split.
- Leaf node output: a class label (the majority class in that leaf).
- Examples:
- Email β Spam or Not Spam
- Tumour β Malignant or Benign
- Loan applicant β Approve or Reject
[Outlook?]
/ | \
Sunny Overcast Rain
β β β
[No] [Yes] [Yes] β leaf = class label
(2) Regression Tree
A Regression Tree is used to predict a continuous numerical value instead of a class.
- Target variable: continuous (e.g., price, temperature, age).
- Splitting criterion: Variance reduction (or Mean Squared Error) β picks the feature that minimizes prediction error in the children.
- Leaf node output: a numerical value (typically the mean of all training samples in that leaf).
- Examples:
- Predict house price from size, location, rooms.
- Predict daily temperature from humidity, pressure.
- Predict student marks from study hours.
[Size > 1500 sqft?]
/ \
Yes No
β β
[Price β 80L] [Size > 800 sqft?]
/ \
Yes No
β β
[Price β 50L] [Price β 25L]
Classification Tree vs Regression Tree
| Aspect | Classification Tree | Regression Tree |
|---|---|---|
| Target variable | Categorical (discrete) | Continuous (numerical) |
| Output at leaf | Class label (majority vote) | Numerical value (mean / median) |
| Splitting criterion | Information Gain / Gini Impurity | Variance Reduction / MSE |
| Use case | Spam detection, tumour classification | House price, temperature prediction |
| Evaluation metric | Accuracy, Precision, Recall, F1 | MSE, RMSE, MAE, RΒ² |
The combined family is also called CART (Classification and Regression Trees) β a single algorithm that supports both modes depending on the target type.
4.7 Overfitting and Pruning
- A decision tree can grow too deep and overfit (memorize training data).
- Pruning cuts unnecessary branches:
- Pre-pruning: Stop growing when improvement is small.
- Post-pruning: Grow full tree, then remove low-value branches.
4.8 Advantages and Disadvantages
Advantages:
β
Easy to understand and visualize.
β
No need for feature scaling.
β
Handles both numerical and categorical data.
β
Implicitly performs feature selection.
Disadvantages:
β Prone to overfitting (especially deep trees).
β Unstable β small changes in data can change the tree.
β Biased toward features with more values.
Applications:
- Medical diagnosis
- Credit risk scoring
- Fraud detection
- Customer segmentation
Section 5: Linear Regression
PYQ: Explain Linear Regression. (2022, 2.5 marks)
PYQ: Explain regression with an example. (2022, 7.5 marks)
5.1 What is Linear Regression?
Definition:
Linear Regression is a supervised learning algorithm used for regression (predicting continuous values). It models the relationship between input features (X) and the output (Y) as a straight line (linear equation).
5.2 The Linear Equation
Simple Linear Regression (one feature):
ΕΆ = wβ + wβX
Where:
ΕΆ = predicted value
X = input feature
wβ = intercept (bias) β where the line crosses Y-axis
wβ = slope (weight) β how much Y changes per unit of X
Multiple Linear Regression (n features):
ΕΆ = wβ + wβXβ + wβXβ + ... + wβXβ
Or in matrix form:
ΕΆ = Xw
Diagram:
Y
β /
β / β Regression Line (ΕΆ = wβ + wβX)
β / β
β /β
β / β
β / β
β/β
ββββββββββββββ X
5.3 Cost Function β Mean Squared Error (MSE)
We measure how well the line fits using Mean Squared Error:
1 m
MSE = βββ Ξ£ (yα΅’ - Ε·α΅’)Β²
m i=1
Where:
yα΅’ = actual value
Ε·α΅’ = predicted value
m = number of samples
The goal of training is to minimize MSE by finding the best values of wβ and wβ.
5.4 Finding Optimal Weights
Method 1: Ordinary Least Squares (Closed-form)
Analytical solution:
w = (Xα΅X)β»ΒΉ Xα΅y
Best for small datasets with few features.
Method 2: Gradient Descent (Iterative)
Repeat until convergence:
wβ := wβ - Ξ± Γ βMSE/βwβ
wβ := wβ - Ξ± Γ βMSE/βwβ
Where Ξ± = learning rate
Partial derivatives:
βMSE/βwβ = (-2/m) Ξ£(yα΅’ - Ε·α΅’)
βMSE/βwβ = (-2/m) Ξ£(yα΅’ - Ε·α΅’)Γxα΅’
5.5 Worked Example
Data:
| X (hours studied) | Y (marks) |
|---|---|
| 1 | 50 |
| 2 | 60 |
| 3 | 70 |
| 4 | 80 |
Using formulas:
n = 4
Ξ£X = 1+2+3+4 = 10
Ξ£Y = 50+60+70+80 = 260
Ξ£XY = (1Γ50)+(2Γ60)+(3Γ70)+(4Γ80) = 50+120+210+320 = 700
Ξ£XΒ² = 1+4+9+16 = 30
wβ = (nΞ£XY - Ξ£XΞ£Y) / (nΞ£XΒ² - (Ξ£X)Β²)
= (4Γ700 - 10Γ260) / (4Γ30 - 100)
= (2800 - 2600) / (120 - 100)
= 200 / 20
= 10
wβ = (Ξ£Y - wβΞ£X) / n
= (260 - 10Γ10) / 4
= (260 - 100) / 4
= 40
Equation: ΕΆ = 40 + 10X
Predict for X=5: ΕΆ = 40 + 10Γ5 = 90 marks
5.6 Advantages and Disadvantages
Advantages:
β
Simple and fast.
β
Easily interpretable (slope tells effect of each feature).
β
Works well when relationship is truly linear.
Disadvantages:
β Assumes a linear relationship (fails on non-linear data).
β Sensitive to outliers.
β Assumes features are independent (multicollinearity hurts it).
Section 6: Logistic Regression
PYQ: Explain Logistic Regression. (2023, 2.5 marks)
6.1 What is Logistic Regression?
Definition:
Logistic Regression is a supervised learning algorithm used for binary classification (output is 0 or 1). Despite the name, it is a classification algorithm, not regression.
It models the probability that an input belongs to a class using the sigmoid function.
6.2 The Sigmoid Function
1
Ο(z) = ββββββββββ
1 + e^(-z)
Where:
z = wβ + wβXβ + wβXβ + ... = linear combination of inputs
Output: always between 0 and 1 (interpreted as probability)
6.3 Decision Rule
β 1 (Class 1) if P(Y=1|X) β₯ 0.5 [i.e., z β₯ 0]
ΕΆ = β
β 0 (Class 0) if P(Y=1|X) < 0.5 [i.e., z < 0]
The decision boundary is the line where z = 0.
6.4 Cost Function β Log Loss (Binary Cross-Entropy)
J(w) = -(1/m) Ξ£ [yα΅’ log(Ε·α΅’) + (1-yα΅’) log(1-Ε·α΅’)]
Where:
yα΅’ = actual label (0 or 1)
Ε·α΅’ = predicted probability
Weights are updated using gradient descent to minimize J(w).
6.5 Linear Regression vs Logistic Regression
| Feature | Linear Regression | Logistic Regression |
|---|---|---|
| Task | Regression | Classification |
| Output | Continuous value | Probability (0β1) |
| Function | Linear (y = wx + b) | Sigmoid (Ο(z)) |
| Loss | MSE | Log Loss |
| Decision | No threshold | Threshold at 0.5 |
| Example | Predict house price | Predict spam or not |
6.6 Advantages and Disadvantages
Advantages:
β
Probabilistic output (useful for confidence scores).
β
Fast to train and predict.
β
Easy to interpret with feature weights.
β
Works well for linearly separable data.
Disadvantages:
β Assumes linear decision boundary.
β Fails on complex, non-linear problems.
β Sensitive to outliers and correlated features.
Applications:
- Email spam detection
- Disease prediction (diabetes, cancer risk)
- Credit approval
- Customer churn prediction
Section 7: Support Vector Machines (SVM)
PYQ: What is the goal of the Support Vector Machine (SVM)? Also discuss the margin. (2022, 15 marks)
PYQ: Support Vectors. (2023, 2.5 marks)
7.1 What is SVM?
Definition:
A Support Vector Machine (SVM) is a supervised learning algorithm that finds the best hyperplane (decision boundary) that maximally separates the two classes.
Key Idea: Don't just find any line that separates classes β find the one with the largest margin (gap) between classes.
7.2 Key Concepts
Hyperplane
A hyperplane is a decision boundary that separates two classes.
- In 2D: a line.
- In 3D: a plane.
- In n-D: a hyperplane.
Equation:
wΒ·x + b = 0
Where:
w = weight vector (normal to hyperplane)
x = input feature vector
b = bias
Support Vectors
Support Vectors are the data points closest to the hyperplane (from each class). They are the critical points β they "support" or define the hyperplane.
Margin
The margin is the total distance between the two parallel boundary lines (one touching each class's closest points).
Margin
ββββββββ
βββββββββββββββββββ β Upper boundary (wΒ·x + b = +1)
β
β β β
ββββββββββββββββββββ β Optimal Hyperplane (wΒ·x + b = 0)
β β β
β
βββββββββββββββββββ β Lower boundary (wΒ·x + b = -1)
β = Class +1 β = Class -1
Margin width:
2
M = βββββ
||w||
SVM maximizes M β minimizes ||w||.
7.3 Hard Margin vs Soft Margin
| Hard Margin SVM | Soft Margin SVM | |
|---|---|---|
| Assumption | Data is perfectly separable | Allows some misclassifications |
| Constraint | All points outside margin | Some points can be inside margin |
| Parameter | None | C (penalty parameter) |
| Sensitivity | Very sensitive to outliers | More robust |
C parameter (regularization):
- High C β Small margin, fewer errors on training data (risk of overfit).
- Low C β Wide margin, allows more errors (better generalization).
7.4 The Kernel Trick
Problem: Data is not always linearly separable in the original feature space.
Solution: The kernel trick maps data to a higher-dimensional space where it becomes separable, without explicitly computing the transformation.
Original 2D space Higher-dimensional space
(not separable) (separable)
ββ β β β β
β β β βββββΊ βββββββββ
ββ β β β β
Common Kernels:
| Kernel | Formula | Use Case |
|---|---|---|
| Linear | K(x,y) = xα΅y | Linearly separable data |
| Polynomial | K(x,y) = (xα΅y + c)α΅ | Polynomial boundaries |
| RBF / Gaussian | K(x,y) = exp(-Ξ³ | |
| Sigmoid | K(x,y) = tanh(Ξ±xα΅y + c) | Neural network-like |
7.5 SVM for Multi-Class
SVM is inherently binary. Two strategies to extend it:
- One-vs-One (OvO): Train one SVM per pair of classes.
- One-vs-All (OvA): Train one SVM per class vs all others.
7.6 Advantages and Disadvantages
Advantages:
β
Works well with high-dimensional data.
β
Effective when number of features > number of samples.
β
Memory efficient (only support vectors matter).
β
Kernel trick handles non-linear problems.
β
Robust to overfitting (especially with soft margin).
Disadvantages:
β Slow on large datasets (quadratic programming).
β Sensitive to feature scaling (needs normalization).
β Choosing the right kernel and C is tricky.
β Hard to interpret compared to Decision Trees.
Applications:
- Image classification
- Text and document categorization
- Bioinformatics (protein classification)
- Face detection
Quick Revision Points
Algorithms at a Glance:
| Algorithm | Type | Key Idea | Key Formula |
|---|---|---|---|
| k-NN | Classification | Majority vote of k nearest neighbours | Euclidean distance |
| NaΓ―ve Bayes | Classification | Bayes theorem + independence assumption | P(C|X) β P(X|C)P(C) |
| Decision Tree | Both | Split on highest info gain | IG = H(S) - Ξ£H(Sα΅₯) |
| Linear Regression | Regression | Fit a line to minimize error | ΕΆ = wβ + wβX |
| Logistic Regression | Classification | Sigmoid maps to probability | Ο(z) = 1/(1+e^-z) |
| SVM | Both | Maximize margin between classes | Maximize 2/||w|| |
Key Formulas:
Euclidean Distance: d = β[Ξ£(aα΅’-bα΅’)Β²]
Entropy: H = -Ξ£ pα΅’ logβ(pα΅’)
Information Gain: IG = H(S) - Ξ£|Sα΅₯|/|S| H(Sα΅₯)
Sigmoid: Ο(z) = 1 / (1 + e^(-z))
SVM Margin: M = 2 / ||w||
Linear Regression: ΕΆ = wβ + wβX
MSE: (1/m) Ξ£(yα΅’ - Ε·α΅’)Β²
Classification vs Regression:
| Algorithm | Classification | Regression |
|---|---|---|
| k-NN | β | β |
| NaΓ―ve Bayes | β | β |
| Decision Tree | β | β |
| Linear Regression | β | β |
| Logistic Regression | β | β |
| SVM | β | β (SVR) |
Expected Exam Questions
15-Mark Questions:
- What is the goal of the Support Vector Machine (SVM)? Also discuss the margin. (2022)
- Explain Decision Trees algorithm in ML and discuss its types. (2023)
- What are the basic assumptions used in Naive Bayes classifier? Why is it called Naive Bayes? Also discuss the working of Naive Bayes algorithm. (2023)
- Define Supervised Learning and explain its working principle. Provide a step-by-step example of how a supervised learning algorithm processes training data and makes predictions. (2024)
- Discuss the importance of labelled data in Supervised Learning and its role in training predictive models. Also, explain Decision Trees by taking a suitable example. (2024)
Short Answer Questions (2.5 marks):
- Explain Linear Regression. (2022)
- Support Vectors. (2023)
- K-NN is called Lazy Learner. Explain it. (2023)
- Underfitting. (2023)
- Logistic Regression. (2023)
Mixed (7.5 marks):
- Explain Naive Bayes Classifier. (2022)
- Explain regression with an example. (2022)
These notes were compiled by Deepak Modi
Last updated: May 2026