WEBMSemester 7

Unit 2: Web Content Mining and Information Retrieval

Information retrieval models, text mining, web search, clustering, and classification

Author: Deepak Modi
Last Updated: 2025-06-15

Syllabus:

Information Retrieval Models, Web Search and IR, Text Mining, Latent Semantic Indexing, Web Spamming, Clustering and Classification of Web Pages, Information Extraction, Web Content Mining.


🎯 PYQ Analysis for Unit 2

High Priority Topics ⭐⭐⭐ (15 marks questions)

  1. Information Retrieval Models (2022-Feb, 2022-Jul, 2022-Dec, 2023, 2024-May, 2024-Dec)
  2. Latent Semantic Indexing (LSI) (2022-Feb - 15 marks)
  3. Clustering and Classification of Web Pages (2022-Feb, 2023, 2024-May)

Medium Priority Topics ⭐⭐ (7-8 marks)

  1. Web Content Mining (2022-Dec, 2023, 2024-Dec)
  2. Web Spamming (2022-Dec, 2023, 2024-May, 2024-Dec)
  3. IR Components & Issues (2022-Jul, 2022-Feb)

Short Answer Topics ⭐ (2.5-3 marks)

  1. Text Mining (2022-Jul, 2023, 2024-Dec)
  2. Information Retrieval (2023, 2024-May, 2024-Dec)
  3. Information Extraction (2022-Feb)
  4. Web Search (2024-Dec)

1. Information Retrieval (IR)

PYQ: What is information retrieval in web mining? (2022-Feb, 8 marks)
PYQ: Define information retrieval with the help of its architecture. (2022-Dec, 15 marks)
PYQ: Explain the components of information retrieval in detail. (2022-Jul, 15 marks)

1.1 What is Information Retrieval?

Information Retrieval (IR) is the process of obtaining relevant information from a large collection of documents based on user queries. It deals with the representation, storage, organization, and access to information items in a systematic way.

The goal of IR is to retrieve documents that are relevant to the user's information need while filtering out irrelevant ones. Unlike database systems that retrieve exact matches, IR systems work with unstructured or semi-structured data and provide ranked results based on relevance.

Key Characteristics:

  • Query-based search (users express needs as queries)
  • Relevance-focused (aims to return most relevant documents)
  • Ranking (documents ordered by relevance score)
  • Approximate matching (finds similar, not just exact matches)

1.2 IR System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 INFORMATION RETRIEVAL SYSTEM                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚                         β”‚                         β”‚
    β–Ό                         β–Ό                         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Document   β”‚         β”‚  Indexing β”‚           β”‚   Query   β”‚
β”‚ Collection  β”‚         β”‚  Module   β”‚           β”‚ Processor β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
       β”‚                      β”‚                       β”‚
       β–Ό                      β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    Text     β”‚         β”‚  Inverted β”‚           β”‚   Query   β”‚
β”‚ Operations  β”‚         β”‚   Index   β”‚           β”‚ Expansion β”‚
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜         β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
       β”‚                      β”‚                       β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚    Matching &   β”‚
                    β”‚     Ranking     β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Ranked Results β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1.3 Components of IR System

1. Document Collection

  • The set of documents to be searched
  • Can include web pages, PDFs, emails, articles
  • Stored in document repository

2. Text Operations (Preprocessing)

Before indexing, documents undergo preprocessing:

OperationDescriptionExample
TokenizationBreaking text into words"Hello World" β†’ ["Hello", "World"]
Stop Word RemovalRemoving common wordsRemove "the", "is", "and"
StemmingReducing to root form"running", "runs" β†’ "run"
LemmatizationDictionary-based normalization"better" β†’ "good"
Case FoldingConverting to lowercase"HELLO" β†’ "hello"

3. Indexing Module

Creates searchable data structure from documents. Most common is Inverted Index:

Inverted Index Structure:
Term        β†’ Document List (with positions)
─────────────────────────────────────────────
"algorithm" β†’ [doc1:pos3, doc5:pos7, doc12:pos2]
"data"      β†’ [doc1:pos1, doc2:pos4, doc7:pos1]
"mining"    β†’ [doc2:pos5, doc5:pos8, doc7:pos2]

Advantages of Inverted Index:

  • Fast lookup for query terms
  • Efficient storage
  • Supports phrase queries with positions

4. Query Processor

  • Parses user queries
  • Applies same text operations as documents
  • Expands query with synonyms (optional)
  • Converts to internal representation

5. Matching and Ranking

  • Compares query representation with document index
  • Calculates relevance score for each document
  • Ranks documents by score
  • Returns top-k results

6. User Interface

  • Accepts user queries
  • Displays ranked results with snippets
  • Provides relevance feedback mechanism

1.4 IR Process Steps

  1. Query Formulation: User enters search query expressing information need
  2. Query Processing: Tokenize, remove stop words, stem, expand with synonyms
  3. Document Matching: Compare processed query with indexed documents
  4. Relevance Scoring: Calculate similarity score for each matching document
  5. Ranking: Sort documents by relevance score (highest first)
  6. Result Presentation: Display ranked results with titles and snippets
  7. Relevance Feedback: User marks relevant/irrelevant results to refine search

1.5 Issues in Information Retrieval

PYQ: Explain the issues in the process of information retrieval. (2022-Jul, 7 marks)

IssueDescriptionExample
SynonymyDifferent words, same meaning"car" vs "automobile" vs "vehicle"
PolysemySame word, different meanings"bank" (financial) vs "bank" (river)
Vocabulary MismatchQuery terms differ from document termsUser: "laptop", Doc: "notebook computer"
ScalabilityHandling billions of documentsWeb-scale search engines
Query AmbiguityUnclear user intent"apple" - fruit or company?
Relevance SubjectivityDifferent users, different needsSame query, different expectations
Ranking QualityBalancing precision and recallToo many or too few results

Solutions:

  • Thesaurus and query expansion for synonymy
  • Word sense disambiguation for polysemy
  • Distributed indexing for scalability
  • Query suggestion and clarification for ambiguity

2. Information Retrieval Models

PYQ: Discuss various information retrieval models in detail. (2022-Dec, 2023, 2024-May, 2024-Dec - 15 marks)
PYQ: Discuss various IR models and their role in web search. (2024-Dec, 15 marks)

2.1 Why IR Models?

IR models provide a theoretical framework for:

  • Representing documents and queries mathematically
  • Defining what "relevance" means between query and document
  • Computing relevance scores for ranking
  • Comparing different retrieval approaches

2.2 Classification of IR Models

Information Retrieval Models
β”‚
β”œβ”€β”€ Classical Models
β”‚   β”œβ”€β”€ Boolean Model
β”‚   β”œβ”€β”€ Vector Space Model (VSM)
β”‚   └── Probabilistic Model
β”‚
β”œβ”€β”€ Alternative Models
β”‚   β”œβ”€β”€ Language Models
β”‚   └── Neural IR Models
β”‚
└── Structured Models
    β”œβ”€β”€ XML Retrieval
    └── Semantic Models

2.3 Boolean Model

Concept: The simplest IR model based on set theory and Boolean algebra. Documents either match or don't match - no ranking.

Representation:

  • Documents represented as set of index terms
  • Queries are Boolean expressions using AND, OR, NOT
  • Result is binary: relevant (1) or not relevant (0)

Example:

Query: "data AND mining AND NOT web"

Document 1: {data, mining, algorithm, analysis}
  β†’ Contains "data" βœ“, Contains "mining" βœ“, No "web" βœ“
  β†’ RELEVANT [1]

Document 2: {data, mining, web, search, crawling}
  β†’ Contains "data" βœ“, Contains "mining" βœ“, Contains "web" βœ—
  β†’ NOT RELEVANT [0]

Document 3: {web, crawling, indexing, spider}
  β†’ No "data" βœ—
  β†’ NOT RELEVANT [0]

Easy Example:

Suppose you have:

  • Doc1: "apple banana apple"
  • Doc2: "banana orange banana"
  • Query: "apple AND banana"

Boolean Model checks if both words are present:

  • Doc1: contains both "apple" and "banana" β†’ Relevant
  • Doc2: only "banana" (no "apple") β†’ Not Relevant

Advantages:

AdvantageExplanation
SimpleEasy to understand and implement
PreciseExact matching, no ambiguity in results
EfficientFast query processing using set operations
FormalBased on well-defined Boolean logic

Disadvantages:

DisadvantageExplanation
No RankingAll matching documents treated equally
No Partial MatchDocument must match all conditions
No Term WeightsAll terms considered equally important
Complex QueriesUsers must know Boolean syntax
Too Few/Many ResultsHard to control result set size

2.4 Vector Space Model (VSM)

Concept: Represents documents and queries as vectors in multi-dimensional space. Each dimension corresponds to a term. Similarity measured by cosine of angle between vectors.

Representation:

  • Document d = (w₁, wβ‚‚, w₃, ..., wβ‚™) where wα΅’ is weight of term i
  • Query q = (w₁, wβ‚‚, w₃, ..., wβ‚™)
  • Similarity = cosine of angle between d and q
Vector Space Visualization (2D simplified):
β”‚
β”‚              ● doc1
β”‚             /
β”‚            / θ₁ (small angle = high similarity)
β”‚           /
β”‚    query ●────────────────────►
β”‚           \
β”‚            \ ΞΈβ‚‚ (large angle = low similarity)
β”‚             \
β”‚              ● doc2
β”‚
└─────────────────────────────────────►

TF-IDF Weighting

The most common weighting scheme in VSM:

ComponentFormulaMeaning
TF (Term Frequency)tf(t,d) = count of t in dHow often term appears in document
IDF (Inverse Document Frequency)idf(t) = log(N/df(t))How rare the term is across all documents
TF-IDFtf-idf(t,d) = tf(t,d) Γ— idf(t)Combined importance weight

Why TF-IDF?

  • TF: Terms appearing more often in a document are more important for that document
  • IDF: Terms appearing in fewer documents are more discriminative
  • Common words like "the", "is" have low IDF (appear everywhere)
  • Rare domain terms have high IDF (more informative)

Cosine Similarity Formula

$$ \cos\theta = \frac{\mathbf{d}\cdot\mathbf{q}}{|\mathbf{d}|,|\mathbf{q}|} = \frac{\sum_{i} d_i q_i}{\sqrt{\sum_{i} d_i^2},\sqrt{\sum_{i} q_i^2}},. $$

Example Calculation:

Document: "data mining data analysis algorithms"
Query: "data mining"

Step 1: Calculate TF
Terms:      data  mining  analysis  algorithms
Doc TF:      2      1        1          1
Query TF:    1      1        0          0

Step 2: Assume IDF values
IDF:        1.2    2.0      1.8        2.5

Step 3: Calculate TF-IDF
Doc:    [2Γ—1.2, 1Γ—2.0, 1Γ—1.8, 1Γ—2.5] = [2.4, 2.0, 1.8, 2.5]
Query:  [1Γ—1.2, 1Γ—2.0, 0, 0]         = [1.2, 2.0, 0, 0]

Step 4: Cosine Similarity
Dot product = 2.4Γ—1.2 + 2.0Γ—2.0 + 0 + 0 = 2.88 + 4.0 = 6.88
||Doc|| = √(2.4² + 2.0² + 1.8² + 2.5²) = √(5.76+4+3.24+6.25) = √19.25 = 4.39
||Query|| = √(1.2² + 2.0²) = √(1.44+4) = √5.44 = 2.33

Similarity = 6.88 / (4.39 Γ— 2.33) = 6.88 / 10.23 = 0.67

Easy Example:

Suppose you have:

  • Doc1: "apple banana apple"
  • Doc2: "banana orange banana"
  • Query: "apple banana"

Count term frequency (TF):

  • Doc1: apple=2, banana=1, orange=0
  • Doc2: apple=0, banana=2, orange=1
  • Query: apple=1, banana=1, orange=0

Calculate cosine similarity (ignore IDF for simplicity):

  • Doc1 vector: [2,1,0], Query: [1,1,0]
  • Cosine similarity = (2Γ—1 + 1Γ—1 + 0Γ—0) / (sqrt(2Β²+1Β²) Γ— sqrt(1Β²+1Β²)) = (2+1)/ (√5 Γ— √2) β‰ˆ 3/3.16 β‰ˆ 0.95
  • Doc2 vector: [0,2,1], Query: [1,1,0]
  • Cosine similarity = (0Γ—1 + 2Γ—1 + 1Γ—0) / (sqrt(0Β²+2Β²+1Β²) Γ— sqrt(1Β²+1Β²)) = (0+2+0)/(√5 Γ— √2) β‰ˆ 2/3.16 β‰ˆ 0.63
  • Doc1 is ranked higher.

Advantages:

AdvantageExplanation
Partial MatchingDocuments can partially match query
RankingDocuments ranked by similarity score
Term WeightingImportant terms weighted higher (TF-IDF)
IntuitiveGeometric interpretation easy to understand

Disadvantages:

DisadvantageExplanation
Independence AssumptionAssumes terms are independent
No SemanticsIgnores word meaning and context
Bag of WordsIgnores word order and structure
High DimensionalityVocabulary size = dimensions

2.5 Probabilistic Model

Concept: Based on probability theory. Estimates the probability that a document is relevant given a query.

Core Idea:

  • P(R|d,q) = Probability that document d is relevant to query q
  • Rank documents by probability of relevance
  • Uses Bayes' theorem for calculation

Binary Independence Model (BIM):

Assumes terms are independent and binary (present/absent).

                P(R|d,q)
Ranking by: ─────────────
                P(RΜ„|d,q)

Using Bayes:
         P(d|R) Γ— P(R)
       = ─────────────────
         P(d|RΜ„) Γ— P(RΜ„)

Easy Example:

Suppose you have:

  • Doc1: "apple banana apple"
  • Doc2: "banana orange banana"
  • Query: "apple banana"

Probabilistic Model tries to estimate: "How likely is each document relevant to the query?" If you know from past data that documents with both "apple" and "banana" are usually relevant, Doc1 will get a higher probability. If you have no feedback, it may use assumptions or initial guesses.

Advantages:

  • Strong theoretical foundation in probability
  • Natural incorporation of relevance feedback
  • Principled ranking approach

Disadvantages:

  • Term independence assumption (unrealistic)
  • Needs relevance judgments for training
  • More complex than VSM

2.6 Language Models

Concept: Models the probability of generating a query from a document's language model. Each document defines a probability distribution over terms.

Query Likelihood Model:

P(q|d) = ∏ P(t|d)  for all terms t in query
         t∈q

Rank by: P(q|d) - probability of generating query q from document d

Easy Example:

Suppose you have:

  • Doc1: "apple banana apple"
  • Doc2: "banana orange banana"
  • Query: "apple banana"

For each document, calculate the probability of generating the query:

  • Doc1: P(apple)=2/3, P(banana)=1/3 β†’ P(query|Doc1) = (2/3) Γ— (1/3) = 2/9
  • Doc2: P(apple)=0, P(banana)=2/3 β†’ P(query|Doc2) = 0 Γ— (2/3) = 0 So Doc1 is ranked higher because it is more likely to generate the query.

Smoothing: Handles zero probabilities for unseen terms using techniques like Jelinek-Mercer or Dirichlet smoothing.

Advantages:

AdvantageExplanation
Probabilistic FoundationBased on solid statistical principles
Handles Term DependenciesCan model term co-occurrences
Incorporates SmoothingDeals with unseen terms effectively

Disadvantages:

DisadvantageExplanation
Computationally IntensiveMore complex calculations
Requires Large DataNeeds good estimates of term probabilities
Sensitive to SmoothingChoice of smoothing affects performance

2.7 Comparison of IR Models

AspectBooleanVector SpaceProbabilisticLanguage
MatchingExactPartialProbabilisticProbabilistic
RankingNoYesYesYes
WeightsNoTF-IDFProbabilisticFrequency
FoundationSet TheoryLinear AlgebraProbabilityProbability
ComplexityLowMediumHighHigh
PerformanceLowGoodGoodExcellent
Best ForPrecise queriesGeneral searchResearchModern search

2.8 Applications of IR Models

  • Boolean Model

    • Library/catalog search, legal document retrieval, structured DB queries, command-line/file-system search, precise filter rules (email filtering).
  • Vector Space Model (VSM / TF‑IDF + Cosine)

    • General web & enterprise search ranking, document similarity and recommendation, document clustering, plagiarism detection, relevance scoring for search UIs.
  • Probabilistic Models (BIM / BM25 family)

    • Ad-hoc retrieval with relevance feedback, classical search engines (BM25 ranking), personalized search ranking, focused retrieval in legal/medical domains.
  • Language Models (Query‑Likelihood, LMIR)

    • Modern ranking/scoring (query likelihood), query suggestion/autocomplete, spoken-query handling, passage retrieval and ranking in search engines.
  • Latent Semantic Indexing (LSI/LSA)

    • Semantic search / query expansion, topic-based recommendation, document clustering, cross-language retrieval, plagiarism/near-duplicate detection.
  • Neural IR Models (Embedding & Deep Models, e.g., BERT)

    • Semantic / contextual search, passage reranking, question answering, conversational search, dense retrieval for low-latency ranking, recommendation and personalization.
  • Structured / Semantic Models (XML, Ontologies)

    • XML/HTML element-aware retrieval, semantic search over knowledge graphs, enterprise data integration, QA over structured documents.

3. Web Search and IR

PYQ: Differentiate between information retrieval and web search. (2022-Jul, 8 marks)
PYQ: What are the searching techniques commonly used in web search? (2022-Feb, 2 marks)

3.1 Web Search vs Traditional IR

Web Search is a specialized form of Information Retrieval focused on searching the World Wide Web. It has unique challenges and characteristics compared to traditional IR systems.

AspectTraditional IRWeb Search
ScaleThousands to millionsBillions of pages
Data TypePlain text documentsHTML with links, multimedia
QualityCurated, reliableVariable, includes spam
DynamismRelatively staticConstantly changing
UsersExpert usersGeneral public
QueriesDetailed, specificShort (2-3 words average)
AuthorityAll documents equalLink-based authority (PageRank)
SpamMinimal concernMajor challenge
StructureFlat collectionHyperlinked graph

3.2 Web Search Techniques

TechniqueDescription
Keyword SearchMatch query terms with page content
Boolean SearchUse AND, OR, NOT operators
Phrase SearchExact phrase matching ("web mining")
Link AnalysisPageRank, HITS for authority
Personalized SearchResults based on user history
Semantic SearchUnderstanding query intent

4. Text Mining

PYQ: Write short notes on Text mining. (2022-Jul, 2023, 2024-Dec - 2.5-3 marks)

4.1 Definition

Text Mining, also known as Text Analytics, is the process of extracting meaningful information and patterns from unstructured text data using natural language processing (NLP), machine learning, and statistical techniques. It involves transforming unstructured text into structured data that can be analyzed to discover insights, trends, and relationships that would be difficult to identify through manual reading.

Advantages of Text Mining:

AdvantageExplanation
ScalabilityCan process large volumes of text data
AutomationAutomates information extraction
Insight DiscoveryUncovers hidden patterns and relationships
EfficiencySaves time compared to manual analysis
Data-Driven DecisionsSupports data-driven decision making
VersatilityApplicable across domains (healthcare, finance, etc.)

Disadvantages of Text Mining:

DisadvantageExplanation
ComplexityRequires expertise in NLP and ML
AmbiguityLanguage ambiguity can lead to errors
Context SensitivityMeaning can change based on context
Data QualityPoor quality text data affects results
Resource IntensiveRequires significant computational resources

Applications of Text Mining:

ApplicationDescription
Sentiment AnalysisAnalyzing opinions in reviews, social media
Topic ModelingDiscovering topics in large text corpora
Spam DetectionIdentifying spam emails or messages
Information ExtractionExtracting structured data from text (e.g., entities, relationships)
Text ClassificationCategorizing documents into predefined classes (e.g., news articles)
Recommendation SystemsSuggesting content based on user preferences
Customer SupportAutomating responses to common queries
HealthcareAnalyzing clinical notes, research papers
FinanceAnalyzing news, reports for market trends

4.2 Text Mining vs Data Mining

AspectText MiningData Mining
Data TypeUnstructured textStructured data
SourceDocuments, web pages, emailsDatabases, spreadsheets
PreprocessingNLP required (tokenization, POS)Standard data cleaning
ChallengesAmbiguity, context, languageMissing values, noise

4.3 Text Mining Process

Text Collection β†’ Preprocessing β†’ Feature Extraction β†’ Analysis β†’ Knowledge
     β”‚                β”‚                  β”‚                β”‚
     β–Ό                β–Ό                  β–Ό                β–Ό
 Gather docs    Tokenize, stem     TF-IDF, BOW      Classify,
 Web pages      Remove stopwords   Word embeddings   Cluster

4.4 Text Mining Techniques

TechniqueDescriptionApplication
ClassificationCategorize documentsSpam detection
ClusteringGroup similar documentsTopic discovery
Sentiment AnalysisDetect opinions/emotionsProduct reviews
Named Entity Recognition (NER)Extract named entitiesPeople, places, orgs
SummarizationCreate document summariesNews summarization

5. Latent Semantic Indexing (LSI)

PYQ: What is latent semantic indexing and where can it be applied? How does LSA work? (2022-Feb, 15 marks)
PYQ: Latent semantic indexing. (2022-Jul, 2023, 2024-May, 2024-Dec - 2.5-7.5 marks)

5.1 What is LSI?

Latent Semantic Indexing (LSI), also called Latent Semantic Analysis (LSA), is an IR technique that uses Singular Value Decomposition (SVD) to discover hidden (latent) semantic relationships between terms and documents.

It addresses the fundamental problem of vocabulary mismatch in traditional IR systems by mapping documents and queries into a reduced semantic space where similar concepts are close together, even if they use different words.

5.2 Problems LSI Solves

ProblemDescriptionExample
SynonymyDifferent words, same meaning"car" vs "automobile"
PolysemySame word, different meanings"bank" (financial vs river)
Vocabulary MismatchQuery differs from document"laptop" vs "notebook"

LSI Solution: Maps documents and terms to a reduced semantic space where similar concepts are close together.

5.3 How LSI Works

Step 1: Create Term-Document Matrix

           Doc1  Doc2  Doc3  Doc4
data        2     1     0     3
mining      1     2     0     2
web         0     1     3     1
search      1     0     2     0

Here, each row represents a term, and each column represents a document. The values are term frequencies (TF).

Step 2: Apply Singular Value Decomposition (SVD)

A = U Γ— Ξ£ Γ— Vα΅€

Where:
A = Original term-document matrix (m Γ— n)
U = Term-concept matrix (m Γ— r) - how terms relate to concepts
Ξ£ = Diagonal matrix of singular values (r Γ— r) - concept importance
V = Document-concept matrix (n Γ— r) - how documents relate to concepts
β”Œβ”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”     β”Œβ”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”α΅€
β”‚     β”‚     β”‚   β”‚     β”‚   β”‚     β”‚     β”‚
β”‚  A  β”‚  =  β”‚ U β”‚  Γ—  β”‚ Ξ£ β”‚  Γ—  β”‚  V  β”‚
β”‚     β”‚     β”‚   β”‚     β”‚   β”‚     β”‚     β”‚
β””β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”˜     β””β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”˜
(m Γ— n)    (m Γ— r)   (r Γ— r)   (r Γ— n)

Here, r = rank of matrix A (number of concepts). T = transpose operation.

Step 3: Dimensionality Reduction

  • Keep only top k (number of concepts) singular values (k << r, typically k = 100-300)
  • Creates approximate matrix Aβ‚– = Uβ‚– Γ— Ξ£β‚– Γ— Vβ‚–α΅€
  • Removes noise, captures main semantic patterns

Step 4: Query in Reduced Space

  • Transform query vector to concept space
  • Compare with document vectors using cosine similarity
  • Return ranked results

5.4 LSI Example

Documents:
D1: "car automobile vehicle transport"
D2: "car motor drive engine"
D3: "truck vehicle engine motor"

After LSI with k=2 concepts:
Concept 1: "Motorized vehicles" (car, motor, engine, drive)
Concept 2: "Transportation" (vehicle, transport, automobile)

Query: "automobile"
- Maps to Concept 2
- Returns D1 (contains "automobile")
- Also returns D2, D3 (related to same concept!)
- Handles synonymy - finds related documents without exact match

5.5 LSI for SEO

StrategyDescription
Use related termsInclude synonyms and related concepts naturally
Comprehensive coverageCover topic thoroughly, not just main keyword
Natural writingAvoid keyword stuffing, write for humans
Semantic relevanceFocus on meaning, not just exact keywords

5.6 Applications of LSI

  • Search Engines: Improved relevance through semantic matching
  • Document Clustering: Grouping semantically similar documents
  • Plagiarism Detection: Finding semantically similar content
  • Cross-Language IR: Matching documents across languages
  • Recommendation Systems: Content-based recommendations

5.7 Advantages and Disadvantages

AdvantagesDisadvantages
Handles synonymyComputationally expensive (SVD)
Reduces noiseConcepts hard to interpret
Improves recallDifficult to update incrementally
Language independentDoesn't fully solve polysemy

6. Web Spamming

PYQ: Define web spamming. (2022-Feb, 1.875 marks)
PYQ: Write short notes on Web spamming. (2022-Dec, 2023, 2024-May, 2024-Dec - 2.5-5 marks)

6.1 Definition

Web Spamming (Spamdexing) is the deliberate manipulation of search engine indexes to achieve higher rankings for web pages that don't deserve them based on actual content or authority.

It involves using deceptive techniques to trick search engines into ranking pages higher than they should be, often for commercial gain or to drive traffic to websites.

6.2 Types of Web Spam

Web Spam
β”œβ”€β”€ Content Spam
β”‚   β”œβ”€β”€ Keyword stuffing : Overusing keywords unnaturally
β”‚   β”œβ”€β”€ Hidden text/links : White text on white background
β”‚   └── Doorway pages : Pages created solely for ranking, redirecting users
β”‚
β”œβ”€β”€ Link Spam
β”‚   β”œβ”€β”€ Link farms : Networks of sites linking to each other
β”‚   β”œβ”€β”€ Paid links : Buying/selling links for ranking
β”‚   └── Comment/Forum spam : Posting links in comments/forums
β”‚
└── Cloaking
    └── Different content for crawlers vs users

6.3 Spam Techniques

TechniqueDescription
Keyword StuffingOverusing keywords unnaturally
Hidden TextWhite text on white background
Link FarmsNetworks of sites linking to each other
CloakingShow different content to crawlers
Doorway PagesPages only for ranking, redirect users
Scraped ContentCopying content from other sites

6.4 Spam Detection Methods

MethodApproach
Content AnalysisDetect keyword density anomalies
Link AnalysisIdentify link farm patterns
Machine LearningTrain classifiers on spam features
TrustRankPropagate trust from seed sites

7. Clustering and Classification of Web Pages

PYQ: What is clustering? How is it different from classification? (2022-Feb, 7 marks)
PYQ: Classification of web pages. (2023, 7.5 marks)
PYQ: Clustering of web pages. (2024-May, 7.5 marks)

7.1 Classification vs Clustering

AspectClassificationClustering
TypeSupervised learningUnsupervised learning
LabelsPredefined categoriesNo predefined labels
TrainingNeeds labeled training dataNo training data needed
GoalAssign to known categoryDiscover natural groups
ExampleSpam vs Not SpamGroup similar news articles

7.2 Web Page Classification

Definition: Assigning web pages to predefined categories using supervised learning.

Process:

Labeled Training Data β†’ Feature Extraction β†’ Train Classifier β†’ Classify New Pages

Features Used:

Feature TypeExamples
ContentWords, TF-IDF scores, topics
StructureHTML tags, headings, lists
LinkInlinks, outlinks, anchor text
URLDomain name, path keywords

Algorithms: Naive Bayes, SVM (Support Vector Machines) , Decision Trees, Neural Networks

Applications:

  • Topic categorization (Sports, News, Technology)
  • Spam detection
  • Sentiment classification
  • Content filtering

7.3 Web Page Clustering

Definition: Grouping similar web pages without predefined categories using unsupervised learning.

Algorithms:

AlgorithmDescription
K-MeansPartition into k clusters based on centroid distance
HierarchicalBuild tree of clusters (agglomerative/divisive)
DBSCANDensity-based clustering

Challenges:

ChallengeDescription
High dimensionalityMany features (vocabulary size)
Sparse dataMost terms don't appear in most docs
NoiseAds, navigation, boilerplate content
ScaleBillions of web pages

Applications:

  • Search result grouping
  • Duplicate detection
  • Topic discovery
  • Website organization

8. Information Extraction

PYQ: Discuss the term information extraction. (2022-Feb, 1.875 marks)

8.1 Definition

Information Extraction (IE) is the process of automatically extracting structured information from unstructured or semi-structured text. It involves identifying and extracting specific pieces of information such as names, dates, locations, relationships, and events from natural language text and organizing them into a structured format that can be stored in databases or used for further analysis.

8.2 IE Tasks

TaskDescriptionExample
NER (Named Entity Recognition)Find named entities"Apple Inc." β†’ Organization
Relation ExtractionFind relationships"Jobs founded Apple"
Event ExtractionFind events"Conference on Dec 5"

8.3 IE vs IR

IE (Information Extraction)IR (Information Retrieval)
Extracts structured dataReturns documents
Deep NLP processingKeyword matching
Specific information (entities, facts)Relevant documents/content
Output: database, facts, triplesOutput: ranked list of documents
Example: extract names, datesExample: find articles on a topic
Used for knowledge base creationUsed for search and discovery

9. Web Content Mining

PYQ: Web content mining. (2022-Dec, 2023, 2024-Dec - 3-7.5 marks)

9.1 Definition

Web Content Mining is the process of extracting useful information from the content of web pages, including text, images, audio, and video. It focuses on discovering and extracting knowledge from the actual content displayed on web pages rather than the structure or usage patterns. This includes analyzing textual content, multimedia elements, and structured data embedded within web pages.

9.2 Types of Web Content

TypeExamples
TextArticles, product descriptions
MultimediaImages, videos, audio
StructuredTables, lists, forms
MetadataTitle, description, keywords

9.3 Techniques

TechniqueApplication
Text MiningTopic extraction, sentiment analysis
NLPEntity recognition, summarization
Image MiningObject detection, classification
Structured Data ExtractionTable extraction, wrapper induction

9.4 Applications

  • Search Engines: Content indexing and retrieval
  • News Aggregation: Automatic news collection
  • Product Comparison: Extract product features
  • Knowledge Base Construction: Build structured knowledge
  • Opinion Mining: Extract user opinions from reviews

Summary Table: Unit 2 Key Concepts

TopicKey Points
Information RetrievalFinding relevant documents from large collections
IR ModelsBoolean (exact), VSM (TF-IDF), Probabilistic, Language
LSISVD for semantic matching, handles synonymy
Web SpammingContent spam, link spam, cloaking
ClassificationSupervised learning, predefined categories
ClusteringUnsupervised learning, discover groups
Text MiningExtract patterns from unstructured text
Web Content MiningMining page content (text, images, multimedia)

Expected Questions for Exam

15 Marks Questions

  1. Information Retrieval Models (all 4 models with comparison)
  2. Latent Semantic Indexing with example and applications
  3. Clustering and Classification of web pages

7-8 Marks Questions

  1. IR components and architecture
  2. Web Search vs Information Retrieval
  3. Web Spamming types and detection
  4. Web Content Mining

2.5-3 Marks Questions

  1. Define Information Retrieval / Text Mining / LSI
  2. Web Spamming
  3. Classification vs Clustering
  4. Information Extraction

These notes were compiled by Deepak Modi
Last updated: December 2025

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!