Syllabus:
Dimensionality Reduction: Definition, Row vector and Column vector, how to represent a dataset, how to represent a dataset as a Matrix, Data preprocessing in Machine Learning: Feature Normalization, Mean of a data matrix, Column Standardization, Co-variance of a Data Matrix, Principal Component Analysis for Dimensionality reduction.
π― PYQ Analysis for Unit 2
PYQs will be added after analysis β check back soon.
Section 1: Dimensionality Reduction
1.1 What is Dimensionality?
In Machine Learning, dimensionality refers to the number of features (columns / attributes) in a dataset.
Example:
A dataset of houses may have features:
- Size (sq ft), No. of Rooms, Location, Price, Age, Floor, Parking, Garden, ...
If there are 100 features, we say the dataset has 100 dimensions.
1.2 What is Dimensionality Reduction?
Definition:
Dimensionality Reduction is the process of reducing the number of features (dimensions) in a dataset while retaining as much important information as possible.
Instead of working with 100 features, we may reduce to 10 features that capture most of the information.
Simple Analogy:
Imagine you have a 3D sculpture. To describe it to someone, you could draw a 2D shadow (projection) of it. The shadow is simpler but still represents the key shape of the object. Dimensionality reduction works the same way β it projects high-dimensional data into a lower dimension without losing the main structure.
1.3 Why is Dimensionality Reduction Needed?
The Curse of Dimensionality:
- As features increase, the data becomes sparse in the high-dimensional space.
- ML models need exponentially more data to learn as dimensions grow.
- Beyond a certain point, adding more features hurts model performance.
Problems with Too Many Dimensions:
- Overfitting β model memorizes training data, fails on new data.
- High Computation Cost β more features = more memory and time.
- Irrelevant Features β some features add noise, not signal.
- Visualization β we can only visualize 2D or 3D data.
- Multicollinearity β many features may be correlated (redundant).
Benefits of Dimensionality Reduction:
β
Reduces training time.
β
Removes noise and redundant features.
β
Prevents overfitting.
β
Makes data easier to visualize.
β
Improves model accuracy in many cases.
Two Main Approaches:
| Approach | Technique | Description |
|---|---|---|
| Feature Selection | Filter, Wrapper, Embedded | Select a subset of original features |
| Feature Extraction | PCA, LDA, Autoencoders | Create new (fewer) features from originals |
Section 2: Vectors and Dataset Representation
2.1 Row Vector
Definition:
A Row Vector is a matrix with only one row and multiple columns. It represents a single data point (one example/sample) with all its features.
Notation:
Row vector: x = [xβ xβ xβ ... xβ]
Shape: (1 Γ n) β 1 row, n columns
Example:
One student's exam record:
x = [85 72 90 65 78]
M1 M2 M3 M4 M5
Here, each value is a score in a subject β one row = one student.
2.2 Column Vector
Definition:
A Column Vector is a matrix with only one column and multiple rows. It represents a single feature across all data points (all values of one attribute).
Notation:
Column vector:
β xβ β
β xβ β
x = β xβ β
β .. β
β xβ β
Shape: (n Γ 1) β n rows, 1 column
Example:
All students' scores in subject Math (M1):
β 85 β
β 72 β
x = β 90 β
β 65 β
β 78 β
Here, each value is one student's Math score β one column = one feature.
Comparison:
| Row Vector | Column Vector | |
|---|---|---|
| Shape | 1 Γ n | n Γ 1 |
| Represents | One data point (sample) | One feature (attribute) |
| Example | [85, 72, 90] β one student | [85, 72, 65] β all students' math marks |
2.3 How to Represent a Dataset
A dataset is a collection of data points (examples / samples).
For ML:
- Each row = one data point (one example / observation).
- Each column = one feature (attribute / variable).
Example Dataset (5 students, 3 subjects):
| Student | Math | Physics | Chemistry |
|---|---|---|---|
| S1 | 85 | 78 | 90 |
| S2 | 72 | 65 | 80 |
| S3 | 90 | 88 | 92 |
| S4 | 60 | 70 | 65 |
| S5 | 78 | 82 | 75 |
- Rows (m = 5): number of samples
- Columns (n = 3): number of features
2.4 How to Represent a Dataset as a Matrix
The dataset is stored as a 2D matrix X of shape (m Γ n):
- m = number of rows (data points / samples)
- n = number of columns (features / dimensions)
Matrix Notation:
Feature 1 Feature 2 Feature 3
β β β
β β
S1 β β xββ xββ xββ β
S2 β β xββ xββ xββ β
S3 β β xββ xββ xββ β
S4 β β xββ xββ xββ β
S5 β β xβ
β xβ
β xβ
β β
β β
X = m Γ n matrix
(5 Γ 3 in this example)
For the student example:
β 85 78 90 β
β 72 65 80 β
X = β 90 88 92 β
β 60 70 65 β
β 78 82 75 β
Shape: 5 Γ 3 (5 students, 3 features)
How to access elements:
- xα΅’β±Ό = element in row i, column j
- xββ = 78 (Student 1's Physics score)
- xββ = 90 (Student 3's Math score)
Special Cases:
| Case | Shape | Name |
|---|---|---|
| m=1 | 1 Γ n | Row vector (one sample) |
| n=1 | m Γ 1 | Column vector (one feature) |
| m=n | n Γ n | Square matrix |
Section 3: Data Preprocessing in Machine Learning
3.1 Why Preprocess Data?
Real-world data is raw, messy, and inconsistent. Before feeding it to an ML algorithm, it must be cleaned and transformed. This is called Data Preprocessing.
Problems in Raw Data:
- Missing values
- Inconsistent formats
- Features on very different scales (e.g., age: 25, salary: 50000)
- Redundant features
Preprocessing Steps Covered Here:
- Computing the Mean of a data matrix
- Feature Normalization
- Column Standardization
- Computing the Covariance matrix
3.2 Mean of a Data Matrix
Definition:
The mean (average) of a data matrix is calculated column-wise (per feature). It gives the average value of each feature across all data points.
Formula:
For feature j:
m
Ξ£ xα΅’β±Ό
ΞΌβ±Ό = i=1
βββββββββ
m
Where:
xα΅’β±Ό = value in row i, column j
m = number of data points (rows)
ΞΌβ±Ό = mean of feature j
Example:
Dataset (3 students, 2 features):
β 80 60 β
X = β 90 70 β
β 70 80 β
Calculate mean of each column:
ΞΌβ (Math mean) = (80 + 90 + 70) / 3 = 240 / 3 = 80
ΞΌβ (Physics mean) = (60 + 70 + 80) / 3 = 210 / 3 = 70
Mean vector:
ΞΌ = [80, 70]
Use of Mean:
- Used for centering data (subtracting mean from each value).
- Used in normalization, standardization, and PCA.
3.3 Feature Normalization
Definition:
Feature Normalization (also called Min-Max Scaling) rescales each feature so that its values fall in the range [0, 1] (or sometimes [β1, 1]).
Why needed?
- If Math scores range from 0β100 and Age ranges from 18β25, the algorithm may give more weight to Math just because the values are larger.
- Normalization puts all features on the same scale so no feature dominates.
Formula (Min-Max Normalization):
xα΅’β±Ό - min(xβ±Ό)
x'α΅’β±Ό = βββββββββββββββββ
max(xβ±Ό) - min(xβ±Ό)
Where:
xα΅’β±Ό = original value
min(xβ±Ό) = minimum value of feature j
max(xβ±Ό) = maximum value of feature j
x'α΅’β±Ό = normalized value (between 0 and 1)
Example:
Feature values (Math): [60, 70, 80, 90, 100]
min = 60, max = 100
Normalize 60: (60 - 60) / (100 - 60) = 0/40 = 0.00
Normalize 70: (70 - 60) / (100 - 60) = 10/40 = 0.25
Normalize 80: (80 - 60) / (100 - 60) = 20/40 = 0.50
Normalize 90: (90 - 60) / (100 - 60) = 30/40 = 0.75
Normalize 100: (100 - 60)/ (100 - 60) = 40/40 = 1.00
Result: [0.00, 0.25, 0.50, 0.75, 1.00]
Properties:
- Output is always in [0, 1].
- Sensitive to outliers (extreme values distort the range).
3.4 Column Standardization (Z-Score Normalization)
Definition:
Column Standardization (Z-Score Standardization) transforms each feature so that it has a mean of 0 and a standard deviation of 1.
Why used instead of normalization?
- Normalization is sensitive to outliers.
- Standardization is more robust β it works well even when outliers are present.
- Required by many algorithms: PCA, SVM, KNN, Logistic Regression.
Formulas:
Step 1 β Calculate Mean (ΞΌβ±Ό):
Ξ£ xα΅’β±Ό
ΞΌβ±Ό = βββββββββ
m
Step 2 β Calculate Standard Deviation (Οβ±Ό):
______________________
/ 1 m
Οβ±Ό = / βββ Ξ£ (xα΅’β±Ό - ΞΌβ±Ό)Β²
\/ m i=1
Step 3 β Standardize:
xα΅’β±Ό - ΞΌβ±Ό
z'α΅’β±Ό = ββββββββββ
Οβ±Ό
Example:
Feature (Math scores): [60, 70, 80, 90, 100]
Mean: ΞΌ = (60+70+80+90+100) / 5 = 400 / 5 = 80
Variance:
ΟΒ² = [(60-80)Β² + (70-80)Β² + (80-80)Β² + (90-80)Β² + (100-80)Β²] / 5
= [400 + 100 + 0 + 100 + 400] / 5
= 1000 / 5
= 200
Std Dev: Ο = β200 β 14.14
Standardize:
z(60) = (60 - 80) / 14.14 = -20/14.14 β -1.41
z(70) = (70 - 80) / 14.14 = -10/14.14 β -0.71
z(80) = (80 - 80) / 14.14 = 0/14.14 = 0.00
z(90) = (90 - 80) / 14.14 = 10/14.14 β +0.71
z(100) = (100 - 80) / 14.14 = 20/14.14 β +1.41
Result: [-1.41, -0.71, 0.00, +0.71, +1.41]
Properties:
- Mean of standardized column = 0.
- Standard deviation of standardized column = 1.
- Values can be negative (unlike normalization).
- Robust to outliers.
Normalization vs Standardization:
| Normalization | Standardization | |
|---|---|---|
| Range | [0, 1] | (-β, +β) |
| Mean | Not fixed | Always 0 |
| Std Dev | Not fixed | Always 1 |
| Outlier Sensitivity | High | Low |
| Formula | (x - min)/(max - min) | (x - ΞΌ)/Ο |
| Use When | Features need bounded output | Features need mean=0, Ο=1 |
3.5 Covariance of a Data Matrix
Definition:
Covariance measures how much two features change together. It tells us the relationship (direction) between two variables.
Interpretation:
| Covariance | Meaning |
|---|---|
| Positive | Both features increase together |
| Negative | One increases, other decreases |
| Zero | No linear relationship between them |
Formula (Covariance between feature j and feature k):
1 m
Cov(j,k) = βββ Ξ£ (xα΅’β±Ό - ΞΌβ±Ό)(xα΅’β - ΞΌβ)
m i=1
Where:
xα΅’β±Ό = value of feature j in sample i
ΞΌβ±Ό = mean of feature j
m = number of samples
Note: Some formulas use (m-1) in the denominator for an unbiased estimate.
3.6 Covariance Matrix
When a dataset has n features, the Covariance Matrix is an n Γ n matrix where element (j, k) is the covariance between feature j and feature k.
Structure:
F1 F2 F3
β β
F1 β βCov(1,1) Cov(1,2) Cov(1,3)β
F2 β βCov(2,1) Cov(2,2) Cov(2,3)β
F3 β βCov(3,1) Cov(3,2) Cov(3,3)β
β β
Diagonal elements: Cov(j,j) = Variance of feature j
Off-diagonal elements: Covariance between two different features
Properties:
- Covariance matrix is always symmetric: Cov(j,k) = Cov(k,j)
- Diagonal = variance of each feature
- Off-diagonal = covariance between pairs of features
3.7 Worked Example β Covariance Matrix
Dataset (3 samples, 2 features):
F1 F2
β β
S1 β β 2 4 β
S2 β β 4 6 β
S3 β β 6 8 β
β β
Step 1: Calculate means:
ΞΌβ = (2 + 4 + 6) / 3 = 12/3 = 4
ΞΌβ = (4 + 6 + 8) / 3 = 18/3 = 6
Step 2: Center the data (subtract mean):
F1-ΞΌβ F2-ΞΌβ
S1: 2-4=-2 4-6=-2
S2: 4-4= 0 6-6= 0
S3: 6-4=+2 8-6=+2
Step 3: Calculate covariances:
Cov(F1,F1) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
= [4 + 0 + 4] / 3 = 8/3 β 2.67
Cov(F2,F2) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
= 8/3 β 2.67
Cov(F1,F2) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
= [4 + 0 + 4] / 3 = 8/3 β 2.67
Covariance Matrix:
β 2.67 2.67 β
Ξ£ = β 2.67 2.67 β
Interpretation: Cov(F1,F2) = 2.67 > 0 β F1 and F2 increase together (positive relationship).
Section 4: Principal Component Analysis (PCA)
4.1 What is PCA?
Definition:
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into a new set of features called Principal Components (PCs). These components:
- Are uncorrelated with each other.
- Are ordered so that the first PC captures the most variance, the second captures the next most, and so on.
The goal is to keep the most informative components and discard the rest, reducing dimensionality without losing much information.
Simple Analogy:
Imagine looking at a 3D object from different angles. From some angles, you see the object's full shape clearly. PCA finds the best angle to view the data so that as much information as possible is visible in fewer dimensions.
4.2 Key Concepts in PCA
| Term | Meaning |
|---|---|
| Principal Components (PCs) | New axes in the transformed space |
| Eigenvectors | Direction of each principal component |
| Eigenvalues | Amount of variance captured by each PC |
| Variance | How spread out the data is along a direction |
| Explained Variance Ratio | % of total variance captured by each PC |
4.3 Intuition Behind PCA
Why eigenvectors and eigenvalues?
- Eigenvectors point in the directions of maximum variance in the data.
- Eigenvalues tell how much variance each eigenvector captures.
- The first eigenvector (PC1) captures the most variance.
- The second eigenvector (PC2) is perpendicular to PC1 and captures the next most.
Diagram:
PC2 (less variance)
β
β β β
β β β
β β β
β β β
βββββββββββββββββββββ PC1 (most variance)
Data is elongated along PC1.
If we keep only PC1, we preserve most information.
4.4 Steps of PCA
Step-by-Step Process:
ββββββββββββββββββββββββββββββββ
β Step 1: Standardize the Data β
ββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββ
β Step 2: Compute Covariance β
β Matrix β
ββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββ
β Step 3: Compute Eigenvalues β
β and Eigenvectors β
ββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββ
β Step 4: Sort by Eigenvalue β
β (Descending) β
ββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββ
β Step 5: Select Top k β
β Principal Components β
ββββββββββββββ¬ββββββββββββββββββ
βΌ
ββββββββββββββββββββββββββββββββ
β Step 6: Project Data onto β
β New Feature Space β
ββββββββββββββββββββββββββββββββ
4.5 Detailed Explanation of Each Step
Step 1: Standardize the Data
- Apply column standardization (mean=0, Ο=1) to all features.
- This ensures no feature dominates due to its scale.
xα΅’β±Ό - ΞΌβ±Ό
z'α΅’β±Ό = ββββββββββ
Οβ±Ό
Step 2: Compute the Covariance Matrix
- Calculate the n Γ n covariance matrix of the standardized data.
1
Ξ£ = βββ Xα΅ X (after mean-centering)
m
- Diagonal: variance of each feature.
- Off-diagonal: how features are correlated.
Step 3: Compute Eigenvalues and Eigenvectors
- Solve the characteristic equation:
det(Ξ£ - Ξ»I) = 0
Where:
Ξ» = eigenvalue
I = identity matrix
-
For each eigenvalue Ξ», solve (Ξ£ - Ξ»I)v = 0 for eigenvector v.
-
Eigenvalue β amount of variance the component explains.
-
Eigenvector β direction of the component.
Step 4: Sort by Eigenvalue (Descending)
- Sort eigenvalues from largest to smallest.
- The corresponding eigenvectors are sorted in the same order.
Ξ»β β₯ Ξ»β β₯ Ξ»β β₯ ... β₯ Ξ»β
β β β
PC1 PC2 PC3 ... (ordered by importance)
- PC1 explains the most variance.
- PC2 explains the second most, and so on.
Step 5: Select Top k Principal Components
- Decide how many components k to keep.
- Common rule: keep enough PCs to explain β₯ 95% of total variance.
Ξ£α΅’ββα΅ Ξ»α΅’
Explained Variance = ββββββββββ Γ 100%
Ξ£α΅’βββΏ Ξ»α΅’
Example:
Eigenvalues: Ξ»β = 5.0, Ξ»β = 3.0, Ξ»β = 1.5, Ξ»β = 0.5
Total = 10.0
PC1 explains: 5.0/10.0 = 50%
PC2 explains: 3.0/10.0 = 30%
PC1+PC2 combined: 80%
PC1+PC2+PC3: 95%
β Keep 3 components to explain 95% of variance.
β Reduce from 4 features to 3 features.
Step 6: Project Data onto New Feature Space
- Create a projection matrix W from the top k eigenvectors.
- Multiply the original data matrix by W:
Z = X Γ W
Where:
X = original data matrix (m Γ n)
W = top k eigenvectors (n Γ k)
Z = projected data (m Γ k)
- Z is the reduced dataset with only k dimensions.
4.6 PCA Example (Worked)
Dataset (4 samples, 2 features):
F1 F2
2 4
4 6
6 8
8 10
Step 1: Standardize
ΞΌβ = 5, Οβ β 2.24
ΞΌβ = 7, Οβ β 2.24
Standardized:
zβ zβ
-1.34 -1.34
-0.45 -0.45
0.45 0.45
1.34 1.34
Step 2: Covariance Matrix
β 1.0 1.0 β
Ξ£ = β 1.0 1.0 β
Step 3: Eigenvalues
det(Ξ£ - Ξ»I) = 0
(1-Ξ»)Β² - 1 = 0
λ² - 2λ = 0
Ξ»(Ξ» - 2) = 0
Ξ»β = 2, Ξ»β = 0
Step 4: Eigenvectors
For Ξ»β = 2: vβ = [1/β2, 1/β2] = [0.71, 0.71]
For Ξ»β = 0: vβ = [1/β2,-1/β2] = [0.71,-0.71]
Step 5: Select PC1 only (explains 100% of variance)
W = [0.71, 0.71]α΅
Step 6: Project
Z = Standardized data Γ W
Zβ = (-1.34)(0.71) + (-1.34)(0.71) = -1.90
Zβ = (-0.45)(0.71) + (-0.45)(0.71) = -0.64
Zβ = ( 0.45)(0.71) + ( 0.45)(0.71) = 0.64
Zβ = ( 1.34)(0.71) + ( 1.34)(0.71) = 1.90
Reduced dataset: [-1.90, -0.64, 0.64, 1.90]
(2 features β 1 feature, no information lost here)
4.7 Advantages and Disadvantages of PCA
Advantages:
β
Reduces training time and memory.
β
Removes correlated (redundant) features.
β
Reduces overfitting.
β
Enables 2D/3D visualization of high-dimensional data.
β
Noise reduction (low-variance components often = noise).
Disadvantages:
β Principal components are hard to interpret (not original features).
β Sensitive to outliers (which skew the covariance matrix).
β Assumes linear relationships between features.
β Requires standardized data (preprocessing needed).
β Information is lost (how much depends on k chosen).
4.8 When to Use PCA
- When you have many features (high-dimensional data).
- When many features are correlated with each other.
- When you want to visualize data in 2D or 3D.
- When training is too slow due to many features.
- When model is overfitting due to too many features.
Quick Revision Points
Core Definitions:
- Dimensionality = number of features in dataset.
- Dimensionality Reduction = reduce features while keeping information.
- Row Vector = 1 Γ n β represents one sample.
- Column Vector = n Γ 1 β represents one feature.
- Dataset Matrix = m Γ n β m samples, n features.
Preprocessing Formulas:
| Technique | Formula | Output |
|---|---|---|
| Mean | ΞΌβ±Ό = Ξ£xα΅’β±Ό / m | Average per feature |
| Normalization | (x - min) / (max - min) | [0, 1] |
| Standardization | (x - ΞΌ) / Ο | Mean=0, Ο=1 |
| Covariance | Cov(j,k) = Ξ£(xα΅’β±Ό-ΞΌβ±Ό)(xα΅’β-ΞΌβ)/m | Relationship strength |
PCA β 6 Steps:
- Standardize data
- Compute covariance matrix
- Compute eigenvalues & eigenvectors
- Sort by eigenvalue (descending)
- Select top k components
- Project data onto new space
Key PCA Facts:
- Eigenvector = direction of PC.
- Eigenvalue = variance explained by PC.
- PC1 = most variance, PC2 = second most, etc.
- PCs are always perpendicular (orthogonal) to each other.
- Explained variance ratio = λᡒ / Σλ à 100%.
Expected Exam Questions
PYQs will be added after analysis β check back soon.
These notes were compiled by Deepak Modi
Last updated: May 2026