MLSemester 8

Unit 2: Dimensionality Reduction

Dataset representation (vectors, matrices), data preprocessing (normalization, standardization, covariance), and PCA for dimensionality reduction.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Dimensionality Reduction: Definition, Row vector and Column vector, how to represent a dataset, how to represent a dataset as a Matrix, Data preprocessing in Machine Learning: Feature Normalization, Mean of a data matrix, Column Standardization, Co-variance of a Data Matrix, Principal Component Analysis for Dimensionality reduction.


🎯 PYQ Analysis for Unit 2

High Priority Topics (15 marks questions)

  1. PCA β€” Definition, Steps, Advantages & Disadvantages β€” (2023: 15 marks, 2022: 15 marks)
  2. Data Preprocessing (Normalization, Mean, Standardization, Covariance) β€” (2024: 15 marks)
  3. Dataset as a Matrix + Advantages of Matrix Representation β€” (2024: 15 marks)
  4. Feature Extraction vs Feature Selection (Subset Selection) β€” (2022: 15 marks)
  5. Dimensionality Reduction (Advantages & Disadvantages) β€” (2023: 15 marks)

Medium Priority Topics (Short answers)

  1. Dimensionality Reduction (Definition + Importance) β€” 2024 (2.5 marks)
  2. Column Vector β€” 2024 (2.5 marks)
  3. Dataset as a Matrix β€” 2024 (2.5 marks)
  4. PCA β€” Dimensionality Reduction β€” 2024 (2.5 marks)
  5. Two Data Preprocessing Techniques β€” 2024 (2.5 marks)

Section 1: Dimensionality Reduction

1.1 What is Dimensionality?

In Machine Learning, dimensionality refers to the number of features (columns / attributes) in a dataset.

Example:

A dataset of houses may have features:

  • Size (sq ft), No. of Rooms, Location, Price, Age, Floor, Parking, Garden, ...

If there are 100 features, we say the dataset has 100 dimensions.


1.2 What is Dimensionality Reduction?

PYQ: What is Dimensionality reduction? (2023, 7.5 marks)
PYQ: Define Dimensionality Reduction and why it is important in Machine Learning? (2024, 2.5 marks)

Definition:

Dimensionality Reduction is the process of reducing the number of features (dimensions) in a dataset while retaining as much important information as possible.

Instead of working with 100 features, we may reduce to 10 features that capture most of the information.

Simple Analogy:

Imagine you have a 3D sculpture. To describe it to someone, you could draw a 2D shadow (projection) of it. The shadow is simpler but still represents the key shape of the object. Dimensionality reduction works the same way β€” it projects high-dimensional data into a lower dimension without losing the main structure.


1.3 Why is Dimensionality Reduction Needed?

PYQ: Write advantages and Disadvantages of Dimensionality Reduction. (2023, 7.5 marks)
PYQ: Compare Feature Extraction and Feature Selection techniques. Explain how dimensionality can be reduced using subset selection procedure. (2022, 15 marks)

The Curse of Dimensionality:

  • As features increase, the data becomes sparse in the high-dimensional space.
  • ML models need exponentially more data to learn as dimensions grow.
  • Beyond a certain point, adding more features hurts model performance.

Problems with Too Many Dimensions:

  1. Overfitting β€” model memorizes training data, fails on new data.
  2. High Computation Cost β€” more features = more memory and time.
  3. Irrelevant Features β€” some features add noise, not signal.
  4. Visualization β€” we can only visualize 2D or 3D data.
  5. Multicollinearity β€” many features may be correlated (redundant).

Benefits of Dimensionality Reduction:

βœ… Reduces training time.
βœ… Removes noise and redundant features.
βœ… Prevents overfitting.
βœ… Makes data easier to visualize.
βœ… Improves model accuracy in many cases.

Two Main Approaches:

ApproachTechniqueDescription
Feature SelectionFilter, Wrapper, EmbeddedSelect a subset of original features
Feature ExtractionPCA, LDA, AutoencodersCreate new (fewer) features from originals

Advantages of Dimensionality Reduction

PYQ: Write advantages and Disadvantages of Dimensionality Reduction. (2023, 7.5 marks)

  1. Data Compression β€” reduces storage space needed for the dataset.
  2. Less Computation Time β€” fewer features = faster training.
  3. Removes Redundant Features β€” drops correlated / duplicated columns.
  4. Improved Visualization β€” high-dimensional data can be plotted in 2D / 3D for better understanding.
  5. Overfitting Prevention β€” fewer features reduce model complexity and improve generalization.
  6. Feature Extraction β€” pulls out the most informative features automatically (useful for downstream ML models).
  7. Data Preprocessing β€” acts as a preprocessing step that improves the performance of the actual ML model.
  8. Improved Performance β€” by reducing noise and irrelevant information, the model trains on cleaner signal.

Disadvantages of Dimensionality Reduction

  1. Information Loss β€” some data is always lost when dimensions are reduced.
  2. Linear Correlations Only β€” PCA finds only linear correlations between variables, which is sometimes undesirable.
  3. Mean/Covariance Insufficient β€” PCA fails when mean and covariance are not enough to define the dataset.
  4. Choosing k is Hard β€” we may not know how many principal components to retain; thumb rules are applied in practice.
  5. Interpretability β€” reduced dimensions are not easily interpretable; relationship between original features and new features is hard to explain.
  6. Overfitting Risk β€” in some cases dimensionality reduction itself can lead to overfitting, especially when the number of components is chosen based on the training data.
  7. Sensitivity to Outliers β€” many techniques (like PCA) are sensitive to outliers, which can bias the representation.
  8. Computational Complexity β€” some techniques (e.g., manifold learning) are computationally expensive on large datasets.

1.4 Feature Selection vs Feature Extraction

PYQ: Compare Feature Extraction and Feature Selection techniques. (2022, part of 15 marks)

Both techniques reduce dimensionality, but in very different ways.

#Feature SelectionFeature Extraction
1Selects a subset of relevant features from the original setExtracts a new set of features that are more informative and compact
2Reduces the dimensionality of the feature space and simplifies the modelCaptures essential information and represents it in a lower-dimensional feature space
3Categorized into Filter, Wrapper, Embedded methodsCategorized into Linear and Non-linear methods
4Requires domain knowledge and feature engineeringCan be applied to raw data without feature engineering
5Improves interpretability and reduces overfittingImproves model performance and handles non-linear relationships
6May lose information if wrong features are selectedMay introduce noise/redundancy if extracted features are not informative

Key takeaway:

  • Feature Selection = keeps original features (just throws some away).
  • Feature Extraction = creates new features (e.g., principal components in PCA).

1.5 Subset Selection Procedure

PYQ: Explain how dimensionality can be reduced using subset selection procedure. (2022, part of 15 marks)

What is Subset Selection?

Subset selection (also called feature selection, variable selection, or attribute selection) is the process of selecting a subset of relevant features (variables, predictors) from the original set for use in model construction.

Why use Subset Selection?

  1. Simplification of models β€” easier to interpret by researchers/users.
  2. Shorter training times β€” fewer features = faster computation.
  3. Avoid the curse of dimensionality.
  4. Enhanced generalization β€” reduces overfitting by removing redundant/irrelevant features.

Central premise: The data contains many features that are either redundant or irrelevant, and can be removed without much loss of information.

Two Main Approaches:

  1. Forward Selection β€” start empty, keep adding features.
  2. Backward Selection β€” start with all, keep removing features.

A. Forward Selection

Idea: Start with no features, and add them one at a time β€” at each step, add the feature that decreases the validation error the most. Stop when no further addition improves performance.

Notation:

  • n = number of input variables.
  • x₁, xβ‚‚, ..., xβ‚™ = input variables.
  • Fα΅’ = a subset of the set of input variables.
  • E(Fα΅’) = error on the validation sample when only the inputs in Fα΅’ are used.

Algorithm:

1. Set Fβ‚€ = βˆ…  and  E(Fβ‚€) = ∞

2. For i = 0, 1, 2, ..., repeat until E(Fα΅’β‚Šβ‚) β‰₯ E(Fα΅’):

   (a) For each input variable xβ±Ό not in Fα΅’:
         Train the model with input variables Fα΅’ βˆͺ {xβ±Ό}
         Calculate E(Fα΅’ βˆͺ {xβ±Ό}) on the validation set.

   (b) Choose xβ‚˜ that causes the least error:
         m = arg min  E(Fα΅’ βˆͺ {xβ±Ό})
                 j

   (c) Set Fα΅’β‚Šβ‚ = Fα΅’ βˆͺ {xβ‚˜}.

3. The set Fα΅’ is output as the best subset.

Remarks:

  1. Stopping criterion: stop if adding any feature does not decrease the error E. You may stop earlier if the decrease is too small β€” a user-defined threshold based on application constraints.

  2. Complexity: To reduce from n features to k features, we train and test the model:

n + (n-1) + (n-2) + ... + (n-k) times  =  O(nΒ²)

So forward selection can be expensive for very high n.


B. Backward Selection

Idea: Start with all features, and remove them one at a time β€” at each step, remove the feature whose removal causes the least increase in error. Stop when removing any feature significantly increases the error.

Algorithm:

1. Set Fβ‚€ = {x₁, xβ‚‚, ..., xβ‚™}  and  E(Fβ‚€) = ∞

2. For i = 0, 1, 2, ..., repeat until E(Fα΅’β‚Šβ‚) > E(Fα΅’):

   (a) For each input variable xβ±Ό in Fα΅’:
         Train the model with input variables Fα΅’ βˆ’ {xβ±Ό}
         Calculate E(Fα΅’ βˆ’ {xβ±Ό}) on the validation set.

   (b) Choose xβ‚˜ that causes the least error after removal:
         m = arg min  E(Fα΅’ βˆ’ {xβ±Ό})
                 j

   (c) Set Fα΅’β‚Šβ‚ = Fα΅’ βˆ’ {xβ‚˜}.

3. The set Fα΅’ is output as the best subset.

Forward vs Backward Selection β€” Comparison

AspectForward SelectionBackward Selection
StartEmpty set βˆ…Full set {x₁, ..., xβ‚™}
OperationAdd features one by oneRemove features one by one
Stops whenAdding feature β‰₯ no improvementRemoving feature increases error
SpeedFaster when target subset is smallFaster when target subset is large
RiskMay miss good features that work in combinationMore accurate, but slower for large n
Best forWhen you suspect only few features matterWhen you suspect most features matter

Both are O(nΒ²) in worst case but forward is usually preferred when the optimal subset size k is small (k β‰ͺ n).


Section 2: Vectors and Dataset Representation

2.1 Row Vector

Definition:

A Row Vector is a matrix with only one row and multiple columns. It represents a single data point (one example/sample) with all its features.

Notation:

Row vector:  x = [x₁  xβ‚‚  x₃  ...  xβ‚™]

Shape: (1 Γ— n)  β†’ 1 row, n columns

Example:

One student's exam record:

x = [85  72  90  65  78]
     M1  M2  M3  M4  M5

Here, each value is a score in a subject β€” one row = one student.


2.2 Column Vector

PYQ: Explain column vector. (2024, 2.5 marks)

Definition:

A Column Vector is a matrix with only one column and multiple rows. It represents a single feature across all data points (all values of one attribute).

Notation:

Column vector:
        β”Œ x₁ ┐
        β”‚ xβ‚‚ β”‚
   x =  β”‚ x₃ β”‚
        β”‚ .. β”‚
        β”” xβ‚™ β”˜

Shape: (n Γ— 1)  β†’ n rows, 1 column

Example:

All students' scores in subject Math (M1):

     β”Œ 85 ┐
     β”‚ 72 β”‚
x =  β”‚ 90 β”‚
     β”‚ 65 β”‚
     β”” 78 β”˜

Here, each value is one student's Math score β€” one column = one feature.

Comparison:

Row VectorColumn Vector
Shape1 Γ— nn Γ— 1
RepresentsOne data point (sample)One feature (attribute)
Example[85, 72, 90] β€” one student[85, 72, 65] β€” all students' math marks

2.3 How to Represent a Dataset

PYQ: What is a dataset? Explain with a suitable example. (2024, part of 15 marks)

Definition:

A dataset is a structured collection of data used for analysis, training, and testing in machine learning. It typically consists of rows and columns, where:

  • Each row represents a single instance or data point (one example / observation / sample).
  • Each column represents a feature (also called an attribute or variable).
  • Optionally, one column is the label (also called the target variable) β€” the value the model must predict.

Example Dataset β€” Purchase Prediction:

PersonAgeIncomePurchased
A2550kNo
B3560kYes
C4580kYes
D2030kNo

Here:

  • Features: Age, Income
  • Label (target): Purchased
  • Rows (m=4): number of samples
  • Columns: 2 features + 1 label

Example Dataset (5 students, 3 subjects):

StudentMathPhysicsChemistry
S1857890
S2726580
S3908892
S4607065
S5788275
  • Rows (m = 5): number of samples
  • Columns (n = 3): number of features

2.4 How to Represent a Dataset as a Matrix

PYQ: How do you represent a dataset as a Matrix? (2024, 2.5 marks)
PYQ: What is a dataset? Explain with a suitable example. Demonstrate how to represent a dataset as a Matrix. Discuss the advantages of matrix representation in Machine Learning. (2024, 15 marks)

The dataset is stored as a 2D matrix X of shape (m Γ— n):

  • m = number of rows (data points / samples)
  • n = number of columns (features / dimensions)

Matrix Notation:

       Feature 1   Feature 2   Feature 3
           ↓           ↓           ↓
       β”Œ                               ┐
 S1 β†’  β”‚  x₁₁       x₁₂       x₁₃   β”‚
 S2 β†’  β”‚  x₂₁       xβ‚‚β‚‚       x₂₃   β”‚
 S3 β†’  β”‚  x₃₁       x₃₂       x₃₃   β”‚
 S4 β†’  β”‚  x₄₁       xβ‚„β‚‚       x₄₃   β”‚
 S5 β†’  β”‚  x₅₁       xβ‚…β‚‚       x₅₃   β”‚
       β””                               β”˜

  X  =  m Γ— n matrix
       (5 Γ— 3 in this example)

For the student example:

        β”Œ 85   78   90 ┐
        β”‚ 72   65   80 β”‚
   X =  β”‚ 90   88   92 β”‚
        β”‚ 60   70   65 β”‚
        β”” 78   82   75 β”˜

Shape: 5 Γ— 3   (5 students, 3 features)

How to access elements:

  • xα΅’β±Ό = element in row i, column j
  • x₁₂ = 78 (Student 1's Physics score)
  • x₃₁ = 90 (Student 3's Math score)

Special Cases:

CaseShapeName
m=11 Γ— nRow vector (one sample)
n=1m Γ— 1Column vector (one feature)
m=nn Γ— nSquare matrix

2.5 Advantages of Matrix Representation in Machine Learning

PYQ: Discuss the advantages of matrix representation in Machine Learning. (2024, part of 15 marks)

Why do almost all ML algorithms internally represent the dataset as a matrix? Because it unlocks a set of powerful, fast operations from linear algebra.

1. Efficient Computation

Most ML algorithms (e.g. Linear Regression, Logistic Regression, Neural Networks) are built on top of linear algebra. Matrix representation enables fast calculations using highly optimized libraries (NumPy, BLAS, LAPACK).

2. Scalability

Matrix form can handle very large datasets efficiently using operations like matrix multiplication and dot products β€” far faster than per-element Python loops.

3. Parallelization

Matrix operations can be parallelized easily on multi-core CPUs and especially on GPUs. This is why deep learning frameworks (PyTorch, TensorFlow) push everything into matrix form β€” it makes training orders of magnitude faster.

4. Vectorization

Vectorized matrix code avoids slow Python loops by replacing them with single matrix operations. The result is cleaner code and massively faster execution β€” a core principle behind every modern ML library.

5. Ease of Mathematical Transformation

Common preprocessing steps β€” feature scaling, normalization, standardization, PCA β€” are all defined as matrix operations. Once data is in matrix form, applying any of these is just one line of math.

6. Uniform Representation

A matrix gives a single, uniform structure for any dataset β€” images, text embeddings, sensor logs, tabular records β€” they all become an m Γ— n matrix that any algorithm can consume.

7. Compatibility with Optimization Algorithms

Optimization techniques like gradient descent rely on matrix calculus (Jacobians, Hessians). Matrix form is mandatory for these to work efficiently.

Summary Table:

AdvantageWhy It Matters in ML
Efficient computationFast linear-algebra-backed training
ScalabilityHandles millions of rows / features
ParallelizationRuns on GPUs in parallel
VectorizationReplaces loops with single ops
Easy transformationsScaling, PCA, normalization in one line
Uniform representationWorks for any data type
Optimization-friendlyRequired by gradient descent and friends

Section 3: Data Preprocessing in Machine Learning

PYQ: Discuss two techniques used in Data preprocessing Machine Learning. (2024, 2.5 marks)
PYQ: Describe the process of Data preprocessing in Machine Learning, focusing on Feature Normalization, Mean calculation, column standardization, and Covariance estimation. (2024, 15 marks)

3.1 Why Preprocess Data?

Real-world data is raw, messy, and inconsistent. Before feeding it to an ML algorithm, it must be cleaned and transformed. This is called Data Preprocessing.

Problems in Raw Data:

  • Missing values
  • Inconsistent formats
  • Features on very different scales (e.g., age: 25, salary: 50000)
  • Redundant features

Preprocessing Steps Covered Here:

  1. Computing the Mean of a data matrix
  2. Feature Normalization
  3. Column Standardization
  4. Computing the Covariance matrix

3.2 Mean of a Data Matrix

Definition:

The mean (average) of a data matrix is calculated column-wise (per feature). It gives the average value of each feature across all data points.

Formula:

For feature j:
          m
         Ξ£  xα΅’β±Ό
ΞΌβ±Ό =    i=1
        ─────────
            m

Where:
  xα΅’β±Ό = value in row i, column j
  m   = number of data points (rows)
  ΞΌβ±Ό  = mean of feature j

Example:

Dataset (3 students, 2 features):

        β”Œ 80   60 ┐
   X =  β”‚ 90   70 β”‚
        β”” 70   80 β”˜

Calculate mean of each column:

μ₁ (Math mean)    = (80 + 90 + 70) / 3 = 240 / 3 = 80

ΞΌβ‚‚ (Physics mean) = (60 + 70 + 80) / 3 = 210 / 3 = 70

Mean vector:

ΞΌ = [80,  70]

Use of Mean:

  • Used for centering data (subtracting mean from each value).
  • Used in normalization, standardization, and PCA.

3.3 Feature Normalization

Definition:

Feature Normalization (also called Min-Max Scaling) rescales each feature so that its values fall in the range [0, 1] (or sometimes [βˆ’1, 1]).

Why needed?

  • If Math scores range from 0–100 and Age ranges from 18–25, the algorithm may give more weight to Math just because the values are larger.
  • Normalization puts all features on the same scale so no feature dominates.

Formula (Min-Max Normalization):

         xα΅’β±Ό - min(xβ±Ό)
x'α΅’β±Ό =  ─────────────────
         max(xβ±Ό) - min(xβ±Ό)

Where:
  xα΅’β±Ό     = original value
  min(xβ±Ό) = minimum value of feature j
  max(xβ±Ό) = maximum value of feature j
  x'α΅’β±Ό    = normalized value (between 0 and 1)

Example:

Feature values (Math): [60, 70, 80, 90, 100]

min = 60,  max = 100

Normalize 60:  (60 - 60) / (100 - 60) = 0/40  = 0.00
Normalize 70:  (70 - 60) / (100 - 60) = 10/40 = 0.25
Normalize 80:  (80 - 60) / (100 - 60) = 20/40 = 0.50
Normalize 90:  (90 - 60) / (100 - 60) = 30/40 = 0.75
Normalize 100: (100 - 60)/ (100 - 60) = 40/40 = 1.00

Result: [0.00, 0.25, 0.50, 0.75, 1.00]

Properties:

  • Output is always in [0, 1].
  • Sensitive to outliers (extreme values distort the range).

3.4 Column Standardization (Z-Score Normalization)

Definition:

Column Standardization (Z-Score Standardization) transforms each feature so that it has a mean of 0 and a standard deviation of 1.

Why used instead of normalization?

  • Normalization is sensitive to outliers.
  • Standardization is more robust β€” it works well even when outliers are present.
  • Required by many algorithms: PCA, SVM, KNN, Logistic Regression.

Formulas:

Step 1 β€” Calculate Mean (ΞΌβ±Ό):

         Ξ£ xα΅’β±Ό
ΞΌβ±Ό =   ─────────
            m

Step 2 β€” Calculate Standard Deviation (Οƒβ±Ό):

            ______________________
           /  1  m
Οƒβ±Ό =     /  ─── Ξ£ (xα΅’β±Ό - ΞΌβ±Ό)Β²
        \/    m  i=1

Step 3 β€” Standardize:

         xα΅’β±Ό - ΞΌβ±Ό
z'α΅’β±Ό =  ──────────
            Οƒβ±Ό

Example:

Feature (Math scores): [60, 70, 80, 90, 100]

Mean:  ΞΌ = (60+70+80+90+100) / 5 = 400 / 5 = 80

Variance:
  σ² = [(60-80)Β² + (70-80)Β² + (80-80)Β² + (90-80)Β² + (100-80)Β²] / 5
     = [400 + 100 + 0 + 100 + 400] / 5
     = 1000 / 5
     = 200

Std Dev:  Οƒ = √200 β‰ˆ 14.14

Standardize:
  z(60)  = (60 - 80)  / 14.14 = -20/14.14 β‰ˆ -1.41
  z(70)  = (70 - 80)  / 14.14 = -10/14.14 β‰ˆ -0.71
  z(80)  = (80 - 80)  / 14.14 =   0/14.14 =  0.00
  z(90)  = (90 - 80)  / 14.14 =  10/14.14 β‰ˆ +0.71
  z(100) = (100 - 80) / 14.14 =  20/14.14 β‰ˆ +1.41

Result: [-1.41, -0.71, 0.00, +0.71, +1.41]

Properties:

  • Mean of standardized column = 0.
  • Standard deviation of standardized column = 1.
  • Values can be negative (unlike normalization).
  • Robust to outliers.

Normalization vs Standardization:

NormalizationStandardization
Range[0, 1](-∞, +∞)
MeanNot fixedAlways 0
Std DevNot fixedAlways 1
Outlier SensitivityHighLow
Formula(x - min)/(max - min)(x - ΞΌ)/Οƒ
Use WhenFeatures need bounded outputFeatures need mean=0, Οƒ=1

3.5 Covariance of a Data Matrix

Definition:

Covariance measures how much two features change together. It tells us the relationship (direction) between two variables.

Interpretation:

CovarianceMeaning
PositiveBoth features increase together
NegativeOne increases, other decreases
ZeroNo linear relationship between them

Formula (Covariance between feature j and feature k):

           1   m
Cov(j,k) = ─── Ξ£  (xα΅’β±Ό - ΞΌβ±Ό)(xα΅’β‚– - ΞΌβ‚–)
           m  i=1

Where:
  xα΅’β±Ό = value of feature j in sample i
  ΞΌβ±Ό  = mean of feature j
  m   = number of samples

Note: Some formulas use (m-1) in the denominator for an unbiased estimate.


3.6 Covariance Matrix

When a dataset has n features, the Covariance Matrix is an n Γ— n matrix where element (j, k) is the covariance between feature j and feature k.

Structure:

           F1        F2        F3
        β”Œ                          ┐
  F1 β†’  β”‚Cov(1,1) Cov(1,2) Cov(1,3)β”‚
  F2 β†’  β”‚Cov(2,1) Cov(2,2) Cov(2,3)β”‚
  F3 β†’  β”‚Cov(3,1) Cov(3,2) Cov(3,3)β”‚
        β””                          β”˜

Diagonal elements: Cov(j,j) = Variance of feature j
Off-diagonal elements: Covariance between two different features

Properties:

  • Covariance matrix is always symmetric: Cov(j,k) = Cov(k,j)
  • Diagonal = variance of each feature
  • Off-diagonal = covariance between pairs of features

3.7 Worked Example β€” Covariance Matrix

Dataset (3 samples, 2 features):

        F1   F2
        β”Œ        ┐
  S1 β†’  β”‚ 2    4 β”‚
  S2 β†’  β”‚ 4    6 β”‚
  S3 β†’  β”‚ 6    8 β”‚
        β””        β”˜

Step 1: Calculate means:

μ₁ = (2 + 4 + 6) / 3 = 12/3 = 4
ΞΌβ‚‚ = (4 + 6 + 8) / 3 = 18/3 = 6

Step 2: Center the data (subtract mean):

        F1-μ₁   F2-ΞΌβ‚‚
  S1:   2-4=-2  4-6=-2
  S2:   4-4= 0  6-6= 0
  S3:   6-4=+2  8-6=+2

Step 3: Calculate covariances:

Cov(F1,F1) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
           = [4 + 0 + 4] / 3 = 8/3 β‰ˆ 2.67

Cov(F2,F2) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
           = 8/3 β‰ˆ 2.67

Cov(F1,F2) = [(-2)(-2) + (0)(0) + (2)(2)] / 3
           = [4 + 0 + 4] / 3 = 8/3 β‰ˆ 2.67

Covariance Matrix:

        β”Œ 2.67   2.67 ┐
  Ξ£ =   β”” 2.67   2.67 β”˜

Interpretation: Cov(F1,F2) = 2.67 > 0 β†’ F1 and F2 increase together (positive relationship).


Section 4: Principal Component Analysis (PCA)

PYQ: Explain PCA with its advantages and disadvantages. (2022, 15 marks)
PYQ: What is PCA? Write down steps of a PCA algorithm with example. (2023, 15 marks)
PYQ: What is PCA, and how does it help with dimensionality reduction? (2024, 2.5 marks)

4.1 What is PCA?

Definition:

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms the original features into a new set of features called Principal Components (PCs). These components:

  • Are uncorrelated with each other.
  • Are ordered so that the first PC captures the most variance, the second captures the next most, and so on.

The goal is to keep the most informative components and discard the rest, reducing dimensionality without losing much information.

Simple Analogy:

Imagine looking at a 3D object from different angles. From some angles, you see the object's full shape clearly. PCA finds the best angle to view the data so that as much information as possible is visible in fewer dimensions.


4.2 Key Concepts in PCA

TermMeaning
Principal Components (PCs)New axes in the transformed space
EigenvectorsDirection of each principal component
EigenvaluesAmount of variance captured by each PC
VarianceHow spread out the data is along a direction
Explained Variance Ratio% of total variance captured by each PC

4.3 Intuition Behind PCA

Why eigenvectors and eigenvalues?

  • Eigenvectors point in the directions of maximum variance in the data.
  • Eigenvalues tell how much variance each eigenvector captures.
  • The first eigenvector (PC1) captures the most variance.
  • The second eigenvector (PC2) is perpendicular to PC1 and captures the next most.

Diagram:

  PC2 (less variance)
   ↑
   β”‚         ● ●
   β”‚       ●   ●
   β”‚         ●   ●
   β”‚       ●   ●
   └───────────────────→ PC1 (most variance)

Data is elongated along PC1.
If we keep only PC1, we preserve most information.

4.4 Steps of PCA

Step-by-Step Process:

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 1: Standardize the Data β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 2: Compute Covariance   β”‚
   β”‚         Matrix               β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 3: Compute Eigenvalues  β”‚
   β”‚         and Eigenvectors     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 4: Sort by Eigenvalue   β”‚
   β”‚         (Descending)         β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 5: Select Top k         β”‚
   β”‚         Principal Components β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Step 6: Project Data onto    β”‚
   β”‚         New Feature Space    β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

4.5 Detailed Explanation of Each Step

Step 1: Standardize the Data

  • Apply column standardization (mean=0, Οƒ=1) to all features.
  • This ensures no feature dominates due to its scale.
         xα΅’β±Ό - ΞΌβ±Ό
z'α΅’β±Ό =  ──────────
            Οƒβ±Ό

Step 2: Compute the Covariance Matrix

  • Calculate the n Γ— n covariance matrix of the standardized data.
        1
Ξ£  =   ───  Xα΅€ X     (after mean-centering)
        m
  • Diagonal: variance of each feature.
  • Off-diagonal: how features are correlated.

Step 3: Compute Eigenvalues and Eigenvectors

  • Solve the characteristic equation:
det(Ξ£ - Ξ»I) = 0

Where:
  Ξ» = eigenvalue
  I = identity matrix
  • For each eigenvalue Ξ», solve (Ξ£ - Ξ»I)v = 0 for eigenvector v.

  • Eigenvalue β†’ amount of variance the component explains.

  • Eigenvector β†’ direction of the component.


Step 4: Sort by Eigenvalue (Descending)

  • Sort eigenvalues from largest to smallest.
  • The corresponding eigenvectors are sorted in the same order.
λ₁ β‰₯ Ξ»β‚‚ β‰₯ λ₃ β‰₯ ... β‰₯ Ξ»β‚™
↓    ↓    ↓
PC1  PC2  PC3  ...  (ordered by importance)
  • PC1 explains the most variance.
  • PC2 explains the second most, and so on.

Step 5: Select Top k Principal Components

  • Decide how many components k to keep.
  • Common rule: keep enough PCs to explain β‰₯ 95% of total variance.
                   Ξ£α΅’β‚Œβ‚α΅ Ξ»α΅’
Explained Variance = ────────── Γ— 100%
                    Ξ£α΅’β‚Œβ‚βΏ Ξ»α΅’

Example:

Eigenvalues: λ₁ = 5.0,  Ξ»β‚‚ = 3.0,  λ₃ = 1.5,  Ξ»β‚„ = 0.5
Total = 10.0

PC1 explains: 5.0/10.0 = 50%
PC2 explains: 3.0/10.0 = 30%
PC1+PC2 combined: 80%
PC1+PC2+PC3: 95%

β†’ Keep 3 components to explain 95% of variance.
β†’ Reduce from 4 features to 3 features.

Step 6: Project Data onto New Feature Space

  • Create a projection matrix W from the top k eigenvectors.
  • Multiply the original data matrix by W:
Z  =  X  Γ—  W

Where:
  X = original data matrix (m Γ— n)
  W = top k eigenvectors   (n Γ— k)
  Z = projected data       (m Γ— k)
  • Z is the reduced dataset with only k dimensions.

4.6 PCA Example (Worked)

Dataset (4 samples, 2 features):

   F1   F2
    2    4
    4    6
    6    8
    8   10

Step 1: Standardize

μ₁ = 5,  σ₁ β‰ˆ 2.24
ΞΌβ‚‚ = 7,  Οƒβ‚‚ β‰ˆ 2.24

Standardized:
   z₁    zβ‚‚
  -1.34  -1.34
  -0.45  -0.45
   0.45   0.45
   1.34   1.34

Step 2: Covariance Matrix

       β”Œ 1.0  1.0 ┐
  Ξ£ =  β”” 1.0  1.0 β”˜

Step 3: Eigenvalues

det(Ξ£ - Ξ»I) = 0
(1-Ξ»)Β² - 1 = 0
λ² - 2λ = 0
Ξ»(Ξ» - 2) = 0
λ₁ = 2,  Ξ»β‚‚ = 0

Step 4: Eigenvectors

For λ₁ = 2:  v₁ = [1/√2, 1/√2] = [0.71, 0.71]
For Ξ»β‚‚ = 0:  vβ‚‚ = [1/√2,-1/√2] = [0.71,-0.71]

Step 5: Select PC1 only (explains 100% of variance)

W = [0.71, 0.71]α΅€

Step 6: Project

Z = Standardized data Γ— W

Z₁ = (-1.34)(0.71) + (-1.34)(0.71) = -1.90
Zβ‚‚ = (-0.45)(0.71) + (-0.45)(0.71) = -0.64
Z₃ = ( 0.45)(0.71) + ( 0.45)(0.71) =  0.64
Zβ‚„ = ( 1.34)(0.71) + ( 1.34)(0.71) =  1.90

Reduced dataset: [-1.90, -0.64, 0.64, 1.90]
(2 features β†’ 1 feature, no information lost here)

4.7 Advantages and Disadvantages of PCA

Advantages:

βœ… Reduces training time and memory.
βœ… Removes correlated (redundant) features.
βœ… Reduces overfitting.
βœ… Enables 2D/3D visualization of high-dimensional data.
βœ… Noise reduction (low-variance components often = noise).

Disadvantages:

❌ Principal components are hard to interpret (not original features).
❌ Sensitive to outliers (which skew the covariance matrix).
❌ Assumes linear relationships between features.
❌ Requires standardized data (preprocessing needed).
❌ Information is lost (how much depends on k chosen).


4.8 When to Use PCA

  • When you have many features (high-dimensional data).
  • When many features are correlated with each other.
  • When you want to visualize data in 2D or 3D.
  • When training is too slow due to many features.
  • When model is overfitting due to too many features.

Quick Revision Points

Core Definitions:

  • Dimensionality = number of features in dataset.
  • Dimensionality Reduction = reduce features while keeping information.
  • Row Vector = 1 Γ— n β€” represents one sample.
  • Column Vector = n Γ— 1 β€” represents one feature.
  • Dataset Matrix = m Γ— n β€” m samples, n features.

Preprocessing Formulas:

TechniqueFormulaOutput
MeanΞΌβ±Ό = Ξ£xα΅’β±Ό / mAverage per feature
Normalization(x - min) / (max - min)[0, 1]
Standardization(x - ΞΌ) / ΟƒMean=0, Οƒ=1
CovarianceCov(j,k) = Ξ£(xα΅’β±Ό-ΞΌβ±Ό)(xα΅’β‚–-ΞΌβ‚–)/mRelationship strength

PCA β€” 6 Steps:

  1. Standardize data
  2. Compute covariance matrix
  3. Compute eigenvalues & eigenvectors
  4. Sort by eigenvalue (descending)
  5. Select top k components
  6. Project data onto new space

Key PCA Facts:

  • Eigenvector = direction of PC.
  • Eigenvalue = variance explained by PC.
  • PC1 = most variance, PC2 = second most, etc.
  • PCs are always perpendicular (orthogonal) to each other.
  • Explained variance ratio = Ξ»α΅’ / Σλ Γ— 100%.

Expected Exam Questions

15-Mark Questions:

  1. Explain PCA with its advantages and disadvantages. (2022)
  2. Compare Feature Extraction and Feature Selection techniques. Explain how dimensionality can be reduced using subset selection procedure. (2022)
  3. What is PCA? Write down steps of a PCA algorithm with example. (2023)
  4. (a) What is Dimensionality reduction? (b) Write advantages and disadvantages of Dimensionality Reduction. (2023)
  5. What is a dataset? Explain with a suitable example. Demonstrate how to represent a dataset as a Matrix. Discuss the advantages of matrix representation in Machine Learning. (2024)
  6. Describe the process of Data preprocessing in Machine Learning, focusing on Feature Normalization, Mean calculation, column standardization, and Covariance estimation. (2024)

Short Answer Questions (2.5 marks):

  1. Define Dimensionality Reduction and why it is important in Machine Learning. (2024)
  2. Explain column vector. (2024)
  3. How do you represent a dataset as a Matrix? (2024)
  4. Discuss two techniques used in Data preprocessing Machine Learning. (2024)
  5. What is PCA, and how does it help with dimensionality reduction? (2024)

These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!