MLSemester 8

Unit 2: Dimensionality Reduction

Dataset representation (vectors, matrices), data preprocessing (normalization, standardization, covariance), and PCA for dimensionality reduction.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Dimensionality Reduction: Definition, Row vector and Column vector, how to represent a dataset, how to represent a dataset as a Matrix, Data preprocessing in Machine Learning: Feature Normalization, Mean of a data matrix, Column Standardization, Co-variance of a Data Matrix, Principal Component Analysis for Dimensionality reduction.


🎯 PYQ Analysis for Unit 2

PYQs will be added after analysis β€” check back soon.


Section 1: Dimensionality Reduction

1.1 What is Dimensionality?

In Machine Learning, dimensionality refers to the number of features (columns / attributes) in a dataset.

Example:

A dataset of houses may have features:

  • Size (sq ft), No. of Rooms, Location, Price, Age, Floor, Parking, Garden, ...

If there are 100 features, we say the dataset has 100 dimensions.


1.2 What is Dimensionality Reduction?

Definition:

Dimensionality Reduction is the process of reducing the number of features (dimensions) in a dataset while retaining as much important information as possible.

Instead of working with 100 features, we may reduce to 10 features that capture most of the information.

Simple Analogy:

Imagine you have a 3D sculpture. To describe it to someone, you could draw a 2D shadow (projection) of it. The shadow is simpler but still represents the key shape of the object. Dimensionality reduction works the same way β€” it projects high-dimensional data into a lower dimension without losing the main structure.


1.3 Why is Dimensionality Reduction Needed?

The Curse of Dimensionality:

  • As features increase, the data becomes sparse in the high-dimensional space.
  • ML models need exponentially more data to learn as dimensions grow.
  • Beyond a certain point, adding more features hurts model performance.

Problems with Too Many Dimensions:

  1. Overfitting β€” model memorizes training data, fails on new data.
  2. High Computation Cost β€” more features = more memory and time.
  3. Irrelevant Features β€” some features add noise, not signal.
  4. Visualization β€” we can only visualize 2D or 3D data.
  5. Multicollinearity β€” many features may be correlated (redundant).

Benefits of Dimensionality Reduction:

βœ… Reduces training time.
βœ… Removes noise and redundant features.
βœ… Prevents overfitting.
βœ… Makes data easier to visualize.
βœ… Improves model accuracy in many cases.

Two Main Approaches:

ApproachTechniqueDescription
Feature SelectionFilter, Wrapper, EmbeddedSelect a subset of original features
Feature ExtractionPCA, LDA, AutoencodersCreate new (fewer) features from originals

Section 2: Vectors and Dataset Representation

2.1 Row Vector

Definition:

A Row Vector is a matrix with only one row and multiple columns. It represents a single data point (one example/sample) with all its features.

Notation:

Row vector:  x = [x₁  xβ‚‚  x₃  ...  xβ‚™]

Shape: (1 Γ— n)  β†’ 1 row, n columns

Example:

One student's exam record:

x = [85  72  90  65  78]
     M1  M2  M3  M4  M5

Here, each value is a score in a subject β€” one row = one student.


2.2 Column Vector

Definition:

A Column Vector is a matrix with only one column and multiple rows. It represents a single feature across all data points (all values of one attribute).

Notation:

Column vector:
        β”Œ x₁ ┐
        β”‚ xβ‚‚ β”‚
   x =  β”‚ x₃ β”‚
        β”‚ .. β”‚
        β”” xβ‚™ β”˜

Shape: (n Γ— 1)  β†’ n rows, 1 column

Example:

All students' scores in subject Math (M1):

     β”Œ 85 ┐
     β”‚ 72 β”‚
x =  β”‚ 90 β”‚
     β”‚ 65 β”‚
     β”” 78 β”˜

Here, each value is one student's Math score β€” one column = one feature.

Comparison:

Row VectorColumn Vector
Shape1 Γ— nn Γ— 1
RepresentsOne data point (sample)One feature (attribute)
Example[85, 72, 90] β€” one student[85, 72, 65] β€” all students' math marks

2.3 How to Represent a Dataset

A dataset is a collection of data points (examples / samples).

For ML:

  • Each row = one data point (one example / observation).
  • Each column = one feature (attribute / variable).

Example Dataset (5 students, 3 subjects):

StudentMathPhysicsChemistry
S1857890
S2726580
S3908892
S4607065
S5788275
  • Rows (m = 5): number of samples
  • Columns (n = 3): number of features

2.4 How to Represent a Dataset as a Matrix

The dataset is stored as a 2D matrix X of shape (m Γ— n):

  • m = number of rows (data points / samples)
  • n = number of columns (features / dimensions)

Matrix Notation:

       Feature 1   Feature 2   Feature 3
           ↓           ↓           ↓
       β”Œ                               ┐
 S1 β†’  β”‚  x₁₁       x₁₂       x₁₃   β”‚
 S2 β†’  β”‚  x₂₁       xβ‚‚β‚‚       x₂₃   β”‚
 S3 β†’  β”‚  x₃₁       x₃₂       x₃₃   β”‚
 S4 β†’  β”‚  x₄₁       xβ‚„β‚‚       x₄₃   β”‚
 S5 β†’  β”‚  x₅₁       xβ‚…β‚‚       x₅₃   β”‚
       β””                               β”˜

  X  =  m Γ— n matrix
       (5 Γ— 3 in this example)

For the student example:

        β”Œ 85   78   90 ┐
        β”‚ 72   65   80 β”‚
   X =  β”‚ 90   88   92 β”‚
        β”‚ 60   70   65 β”‚
        β”” 78   82   75 β”˜

Shape: 5 Γ— 3   (5 students, 3 features)

How to access elements:

  • xα΅’β±Ό = element in row i, column j
  • x₁₂ = 78 (Student 1's Physics score)
  • x₃₁ = 90 (Student 3's Math score)

Special Cases:

CaseShapeName
m=11 Γ— nRow vector (one sample)
n=1m Γ— 1Column vector (one feature)
m=nn Γ— nSquare matrix

Section 3: Data Preprocessing in Machine Learning

3.1 Why Preprocess Data?

Real-world data is raw, messy, and inconsistent. Before feeding it to an ML algorithm, it must be cleaned and transformed. This is called Data Preprocessing.

Problems in Raw Data:

  • Missing values
  • Inconsistent formats
  • Features on very different scales (e.g., age: 25, salary: 50000)
  • Redundant features

Preprocessing Steps Covered Here:

  1. Computing the Mean of a data matrix
  2. Feature Normalization
  3. Column Standardization
  4. Computing the Covariance matrix

3.2 Mean of a Data Matrix

Definition:

The mean (average) of a data matrix is calculated column-wise (per feature). It gives the average value of each feature across all data points.

Formula:

For feature j:
          m
         Ξ£  xα΅’β±Ό
ΞΌβ±Ό =    i=1
        ─────────
            m

Where:
  xα΅’β±Ό = value in row i, column j
  m   = number of data points (rows)
  ΞΌβ±Ό  = mean of feature j

Example:

Dataset (3 students, 2 features):

        β”Œ 80   60 ┐
   X =  β”‚ 90   70 β”‚
        β”” 70   80 β”˜

Calculate mean of each column:

μ₁ (Math mean)    = (80 + 90 + 70) / 3 = 240 / 3 = 80

ΞΌβ‚‚ (Physics mean) = (60 + 70 + 80) / 3 = 210 / 3 = 70

Mean vector:

ΞΌ = [80,  70]

Use of Mean:

  • Used for centering data (subtracting mean from each value).
  • Used in normalization, standardization, and PCA.

3.3 Feature Normalization

Definition:

Feature Normalization (also called Min-Max Scaling) rescales each feature so that its values fall in the range [0, 1] (or sometimes [βˆ’1, 1]).

Why needed?

  • If Math scores range from 0–100 and Age ranges from 18–25, the algorithm may give more weight to Math just because the values are larger.
  • Normalization puts all features on the same scale so no feature dominates.

Formula (Min-Max Normalization):

         xα΅’β±Ό - min(xβ±Ό)
x'α΅’β±Ό =  ─────────────────
         max(xβ±Ό) - min(xβ±Ό)

Where:
  xα΅’β±Ό     = original value
  min(xβ±Ό) = minimum value of feature j
  max(xβ±Ό) = maximum value of feature j
  x'α΅’β±Ό    = normalized value (between 0 and 1)

Example:

Feature values (Math): [60, 70, 80, 90, 100]

min = 60,  max = 100

Normalize 60:  (60 - 60) / (100 - 60) = 0/40  = 0.00
Normalize 70:  (70 - 60) / (100 - 60) = 10/40 = 0.25
Normalize 80:  (80 - 60) / (100 - 60) = 20/40 = 0.50
Normalize 90:  (90 - 60) / (100 - 60) = 30/40 = 0.75
Normalize 100: (100 - 60)/ (100 - 60) = 40/40 = 1.00

Result: [0.00, 0.25, 0.50, 0.75, 1.00]

Properties:

  • Output is always in [0, 1].
  • Sensitive to outliers (extreme values distort the range).

3.4 Column Standardization (Z-Score Normalization)

Definition:

Column Standardization (Z-Score Standardization) transforms each feature so that it has a mean of 0 and a standard deviation of 1.

Why used instead of normalization?

  • Normalization is sensitive to outliers.
  • Standardization is more robust β€” it works well even when outliers are present.
  • Required by many algorithms: PCA, SVM, KNN, Logistic Regression.

Formulas:

Step 1 β€” Calculate Mean (ΞΌβ±Ό):

         Ξ£ xα΅’β±Ό
ΞΌβ±Ό =   ─────────
            m

Step 2 β€” Calculate Standard Deviation (Οƒβ±Ό):

            ______________________
           /  1  m
Οƒβ±Ό =     /  ─── Ξ£ (xα΅’β±Ό - ΞΌβ±Ό)Β²
        \/    m  i=1

Step 3 β€” Standardize:

         xα΅’β±Ό - ΞΌβ±Ό
z'α΅’β±Ό =  ──────────
            Οƒβ±Ό

Example:

Feature (Math scores): [60, 70, 80, 90, 100]

Mean:  ΞΌ = (60+70+80+90+100) / 5 = 400 / 5 = 80

Variance:
  σ² = [(60-80)Β² + (70-80)Β² + (80-80)Β² + (90-80)Β² + (100-80)Β²] / 5
     = [400 + 100 + 0 + 100 + 400] / 5
     = 1000 / 5
     = 200

Std Dev:  Οƒ = √200 β‰ˆ 14.14

Standardize:
  z(60)  = (60 - 80)  / 14.14 = -20/14.14 β‰ˆ -1.41
  z(70)  = (70 - 80)  / 14.14 = -10/14.14 β‰ˆ -0.71
  z(80)  = (80 - 80)  / 14.14 =   0/14.14 =  0.00
  z(90)  = (90 - 80)  / 14.14 =  10/14.14 β‰ˆ +0.71
  z(100) = (100 - 80) / 14.14 =  20/14.14 β‰ˆ +1.41

Result: [-1.41, -0.71, 0.00, +0.71, +1.41]

Properties:

  • Mean of standardized column = 0.
  • Standard deviation of standardized column = 1.
  • Values can be negative (unlike normalization).
  • Robust to outliers.

Normalization vs Standardization:

NormalizationStandardization
Range[0, 1](-∞, +∞)
MeanNot fixedAlways 0
Std DevNot fixedAlways 1
Outlier SensitivityHighLow
Formula(x - min)/(max - min)(x - ΞΌ)/Οƒ
Use WhenFeatures need bounded outputFeatures need mean=0, Οƒ=1

3.5 Covariance of a Data Matrix

Definition:

Covariance measures how much two features change together. It tells us the relationship (direction) between two variables.

Interpretation:

CovarianceMeaning
PositiveBoth features increase together
NegativeOne increases, other decreases
ZeroNo linear relationship between them

Formula (Covariance between feature j and feature k):

           1   m
Cov(j,k) = ─── Ξ£  (xα΅’β±Ό - ΞΌβ±Ό)(xα΅’β‚– - ΞΌβ‚–)
           m  i=1

Where:
  xα΅’β±Ό = value of feature j in sample i
  ΞΌβ±Ό  = mean of feature j
  m   = number of samples

Note: Some formulas use (m-1) in the denominator for an unbiased estimate.


3.6 Covariance Matrix

When a dataset has n features, the Covariance Matrix is an n Γ— n matrix where element (j, k) is the covariance between feature j and feature k.

Structure:

           F1        F2        F3
        β”Œ                          ┐
  F1 β†’  β”‚Cov(1,1) Cov(1,2) Cov(1,3)β”‚
  F2 β†’  β”‚Cov(2,1) Cov(2,2) Cov(2,3)β”‚
  F3 β†’  β”‚Cov(3,1) Cov(3,2) Cov(3,3)β”‚
        β””                          β”˜

Diagonal elements: Cov(j,j) = Variance of feature j
Off-diagonal elements: Covariance between two different features

Properties:

  • Covariance matrix is always symmetric: Cov(j,k) = Cov(k,j)
  • Diagonal = variance of each feature
  • Off-diagonal = covariance between pairs of features

3.7 Worked Example β€” Covariance Matrix

Dataset (3 samples, 2 features):

        F1   F2
        β”Œ        ┐
  S1 β†’  β”‚ 2    4 β”‚
  S2 β†’  β”‚ 4    6 β”‚
  S3 β†’  β”‚ 6    8 β”‚
        β””        β”˜

Step 1: Calculate means:

μ₁ = (2 + 4 + 6) / 3 = 12/3 = 4
ΞΌβ‚‚ = (4 + 6 + 8) / 3 = 18/3 = 6

Step 2: Center the data (subtract mean):

        F1-μ₁   F2-ΞΌβ‚‚
  S1:   2-4=-2  4-6=-2
  S2:   4-4= 0  6-6= 0
  S3:   6-4=+2  8-6=+2

Step 3: Calculate covariances:

Cov(F1,F1) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
           = [4 + 0 + 4] / 3 = 8/3 β‰ˆ 2.67

Cov(F2,F2) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
           = 8/3 β‰ˆ 2.67

Cov(F1,F2) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
           = [4 + 0 + 4] / 3 = 8/3 β‰ˆ 2.67

Covariance Matrix:

        β”Œ 2.67   2.67 ┐
  Ξ£ =   β”” 2.67   2.67 β”˜

Interpretation: Cov(F1,F2) = 2.67 > 0 β†’ F1 and F2 increase together (positive relationship).


Section 4: Principal Component Analysis (PCA)

4.1 What is PCA?

Definition:

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into a new set of features called Principal Components (PCs). These components:

  • Are uncorrelated with each other.
  • Are ordered so that the first PC captures the most variance, the second captures the next most, and so on.

The goal is to keep the most informative components and discard the rest, reducing dimensionality without losing much information.

Simple Analogy:

Imagine looking at a 3D object from different angles. From some angles, you see the object's full shape clearly. PCA finds the best angle to view the data so that as much information as possible is visible in fewer dimensions.


4.2 Key Concepts in PCA

TermMeaning
Principal Components (PCs)New axes in the transformed space
EigenvectorsDirection of each principal component
EigenvaluesAmount of variance captured by each PC
VarianceHow spread out the data is along a direction
Explained Variance Ratio% of total variance captured by each PC

4.3 Intuition Behind PCA

Why eigenvectors and eigenvalues?

  • Eigenvectors point in the directions of maximum variance in the data.
  • Eigenvalues tell how much variance each eigenvector captures.
  • The first eigenvector (PC1) captures the most variance.
  • The second eigenvector (PC2) is perpendicular to PC1 and captures the next most.

Diagram:

  PC2 (less variance)
   ↑
   β”‚         ● ●
   β”‚       ●   ●
   β”‚         ●   ●
   β”‚       ●   ●
   └───────────────────→ PC1 (most variance)

Data is elongated along PC1.
If we keep only PC1, we preserve most information.

4.4 Steps of PCA

Step-by-Step Process:

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 1: Standardize the Data β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 2: Compute Covariance   β”‚
   β”‚         Matrix               β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 3: Compute Eigenvalues  β”‚
   β”‚         and Eigenvectors     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 4: Sort by Eigenvalue   β”‚
   β”‚         (Descending)         β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 5: Select Top k         β”‚
   β”‚         Principal Components β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 6: Project Data onto    β”‚
   β”‚         New Feature Space    β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.5 Detailed Explanation of Each Step

Step 1: Standardize the Data

  • Apply column standardization (mean=0, Οƒ=1) to all features.
  • This ensures no feature dominates due to its scale.
         xα΅’β±Ό - ΞΌβ±Ό
z'α΅’β±Ό =  ──────────
            Οƒβ±Ό

Step 2: Compute the Covariance Matrix

  • Calculate the n Γ— n covariance matrix of the standardized data.
        1
Ξ£  =   ───  Xα΅€ X     (after mean-centering)
        m
  • Diagonal: variance of each feature.
  • Off-diagonal: how features are correlated.

Step 3: Compute Eigenvalues and Eigenvectors

  • Solve the characteristic equation:
det(Ξ£ - Ξ»I) = 0

Where:
  Ξ» = eigenvalue
  I = identity matrix
  • For each eigenvalue Ξ», solve (Ξ£ - Ξ»I)v = 0 for eigenvector v.

  • Eigenvalue β†’ amount of variance the component explains.

  • Eigenvector β†’ direction of the component.


Step 4: Sort by Eigenvalue (Descending)

  • Sort eigenvalues from largest to smallest.
  • The corresponding eigenvectors are sorted in the same order.
λ₁ β‰₯ Ξ»β‚‚ β‰₯ λ₃ β‰₯ ... β‰₯ Ξ»β‚™
↓    ↓    ↓
PC1  PC2  PC3  ...  (ordered by importance)
  • PC1 explains the most variance.
  • PC2 explains the second most, and so on.

Step 5: Select Top k Principal Components

  • Decide how many components k to keep.
  • Common rule: keep enough PCs to explain β‰₯ 95% of total variance.
                   Ξ£α΅’β‚Œβ‚α΅ Ξ»α΅’
Explained Variance = ────────── Γ— 100%
                    Ξ£α΅’β‚Œβ‚βΏ Ξ»α΅’

Example:

Eigenvalues: λ₁ = 5.0,  Ξ»β‚‚ = 3.0,  λ₃ = 1.5,  Ξ»β‚„ = 0.5
Total = 10.0

PC1 explains: 5.0/10.0 = 50%
PC2 explains: 3.0/10.0 = 30%
PC1+PC2 combined: 80%
PC1+PC2+PC3: 95%

β†’ Keep 3 components to explain 95% of variance.
β†’ Reduce from 4 features to 3 features.

Step 6: Project Data onto New Feature Space

  • Create a projection matrix W from the top k eigenvectors.
  • Multiply the original data matrix by W:
Z  =  X  Γ—  W

Where:
  X = original data matrix (m Γ— n)
  W = top k eigenvectors   (n Γ— k)
  Z = projected data       (m Γ— k)
  • Z is the reduced dataset with only k dimensions.

4.6 PCA Example (Worked)

Dataset (4 samples, 2 features):

   F1   F2
    2    4
    4    6
    6    8
    8   10

Step 1: Standardize

μ₁ = 5,  σ₁ β‰ˆ 2.24
ΞΌβ‚‚ = 7,  Οƒβ‚‚ β‰ˆ 2.24

Standardized:
   z₁    zβ‚‚
  -1.34  -1.34
  -0.45  -0.45
   0.45   0.45
   1.34   1.34

Step 2: Covariance Matrix

       β”Œ 1.0  1.0 ┐
  Ξ£ =  β”” 1.0  1.0 β”˜

Step 3: Eigenvalues

det(Ξ£ - Ξ»I) = 0
(1-Ξ»)Β² - 1 = 0
λ² - 2λ = 0
Ξ»(Ξ» - 2) = 0
λ₁ = 2,  Ξ»β‚‚ = 0

Step 4: Eigenvectors

For λ₁ = 2:  v₁ = [1/√2, 1/√2] = [0.71, 0.71]
For Ξ»β‚‚ = 0:  vβ‚‚ = [1/√2,-1/√2] = [0.71,-0.71]

Step 5: Select PC1 only (explains 100% of variance)

W = [0.71, 0.71]α΅€

Step 6: Project

Z = Standardized data Γ— W

Z₁ = (-1.34)(0.71) + (-1.34)(0.71) = -1.90
Zβ‚‚ = (-0.45)(0.71) + (-0.45)(0.71) = -0.64
Z₃ = ( 0.45)(0.71) + ( 0.45)(0.71) =  0.64
Zβ‚„ = ( 1.34)(0.71) + ( 1.34)(0.71) =  1.90

Reduced dataset: [-1.90, -0.64, 0.64, 1.90]
(2 features β†’ 1 feature, no information lost here)

4.7 Advantages and Disadvantages of PCA

Advantages:

βœ… Reduces training time and memory.
βœ… Removes correlated (redundant) features.
βœ… Reduces overfitting.
βœ… Enables 2D/3D visualization of high-dimensional data.
βœ… Noise reduction (low-variance components often = noise).

Disadvantages:

❌ Principal components are hard to interpret (not original features).
❌ Sensitive to outliers (which skew the covariance matrix).
❌ Assumes linear relationships between features.
❌ Requires standardized data (preprocessing needed).
❌ Information is lost (how much depends on k chosen).


4.8 When to Use PCA

  • When you have many features (high-dimensional data).
  • When many features are correlated with each other.
  • When you want to visualize data in 2D or 3D.
  • When training is too slow due to many features.
  • When model is overfitting due to too many features.

Quick Revision Points

Core Definitions:

  • Dimensionality = number of features in dataset.
  • Dimensionality Reduction = reduce features while keeping information.
  • Row Vector = 1 Γ— n β€” represents one sample.
  • Column Vector = n Γ— 1 β€” represents one feature.
  • Dataset Matrix = m Γ— n β€” m samples, n features.

Preprocessing Formulas:

TechniqueFormulaOutput
MeanΞΌβ±Ό = Ξ£xα΅’β±Ό / mAverage per feature
Normalization(x - min) / (max - min)[0, 1]
Standardization(x - ΞΌ) / ΟƒMean=0, Οƒ=1
CovarianceCov(j,k) = Ξ£(xα΅’β±Ό-ΞΌβ±Ό)(xα΅’β‚–-ΞΌβ‚–)/mRelationship strength

PCA β€” 6 Steps:

  1. Standardize data
  2. Compute covariance matrix
  3. Compute eigenvalues & eigenvectors
  4. Sort by eigenvalue (descending)
  5. Select top k components
  6. Project data onto new space

Key PCA Facts:

  • Eigenvector = direction of PC.
  • Eigenvalue = variance explained by PC.
  • PC1 = most variance, PC2 = second most, etc.
  • PCs are always perpendicular (orthogonal) to each other.
  • Explained variance ratio = Ξ»α΅’ / Σλ Γ— 100%.

Expected Exam Questions

PYQs will be added after analysis β€” check back soon.


These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!