WEBMSemester 7

Unit 1: Data Mining Foundations & Web Mining Overview

Data mining concepts, web mining taxonomy, hypertext data, and web mining challenges

Author: Deepak Modi
Last Updated: 2025-06-15

Syllabus:

Data Mining Foundations: Basic concepts in Data Mining, Web mining versus Data mining, Discovering knowledge from Hypertext data.
An Overview of Web Mining: What is Web mining, Web mining taxonomy, Web mining subtasks, issues, challenges.


🎯 PYQ Analysis for Unit 1

High Priority Topics ⭐⭐⭐

  1. Web Mining vs Data Mining (2022-Feb, 2022-Dec, 2023, 2024-May, 2024-Dec)
  2. Discovering Knowledge from Hypertext Data (2022-Feb, 2023, 2024-May, 2024-Dec)
  3. Web Mining Issues & Challenges (2022-Feb, 2022-Jul, 2023, 2024-May, 2024-Dec)
  4. Web Mining Subtasks (2022-Feb, 2023)

Medium Priority Topics ⭐⭐

  1. Web Mining Taxonomy (2024-May, 2024-Dec)
  2. Data Mining Applications (2022-Jul, 2022-Dec)
  3. Social Impacts of Data Mining (2022-Jul)

Short Answer Topics ⭐

  1. Define Mining (2022-Feb)
  2. Hypertext Data (2023, 2024-May)

Section 1: Data Mining Foundations

1.1 Introduction to Data Mining

PYQ: Define mining. (2022-Feb, 1.875 marks)
PYQ: Define data mining. (2022-Jul, 3 marks)
PYQ: What is data mining? (2024-May, 15 marks)

What is Data Mining?

Data Mining is the process of discovering meaningful patterns, correlations, anomalies, and trends from large datasets using statistical, mathematical, and computational techniques. It is also known as Knowledge Discovery in Databases (KDD).

It involves analyzing data from different perspectives and summarizing it into useful information that can help organizations make better decisions, predict future trends, and gain competitive advantages.

Key Characteristics:

  • Automatic Discovery - Finds patterns without explicit programming
  • Predictive - Predicts future trends based on historical data
  • Scalable - Works with large datasets (terabytes/petabytes)
  • Actionable - Produces insights that can be acted upon
  • Non-trivial - Discovers hidden, previously unknown patterns

Data Mining Process (KDD Process)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  KNOWLEDGE DISCOVERY IN DATABASES (KDD)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                  β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚        Data Selection       β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚      Data Preprocessing     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚     Data Transformation     β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚         Data Mining         β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚     Pattern Evaluation      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”
          β”‚   KNOWLEDGE   β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Steps in KDD Process:

  1. Data Selection: Identifying relevant data from databases
  2. Data Preprocessing: Cleaning, handling missing values, removing noise
  3. Data Transformation: Converting data into suitable format (normalization, aggregation)
  4. Data Mining: Applying algorithms to extract patterns
  5. Pattern Evaluation: Interpreting and validating discovered patterns
  6. Knowledge Presentation: Visualizing and presenting results

Data Mining Techniques

TechniqueDescriptionExample
ClassificationAssigns items to predefined categoriesEmail spam detection
ClusteringGroups similar items togetherCustomer segmentation
Association RulesFinds relationships between itemsMarket basket analysis
RegressionPredicts numerical valuesStock price prediction
Anomaly DetectionIdentifies unusual patternsFraud detection
Sequential Pattern MiningFinds patterns in sequencesWeb clickstream analysis

Applications of Data Mining

PYQ: Explain how data mining is used in healthcare analysis. (2022-Jul, 8 marks)
PYQ: Application of data mining in healthcare. (2022-Dec, 8 marks)

DomainApplicationExample
HealthcareDisease prediction, drug discoveryPredicting diabetes risk, identifying drug interactions
Banking & FinanceFraud detection, credit scoringCredit card fraud detection, loan approval
RetailMarket basket analysis, customer profilingProduct recommendations, inventory management
TelecommunicationsChurn prediction, network optimizationIdentifying customers likely to switch providers
EducationStudent performance predictionIdentifying at-risk students
ManufacturingQuality control, predictive maintenanceDetecting defective products

Data Mining in Healthcare (Detailed)

PYQ: Explain how data mining is used in healthcare analysis. (2022-Jul, 8 marks)

Applications:

  1. Disease Diagnosis & Prediction

    • Using patient symptoms and history to predict diseases
    • Early detection of cancer, diabetes, heart disease
    • Example: Predicting COVID-19 severity based on patient data
  2. Drug Discovery

    • Identifying potential drug candidates
    • Analyzing drug interactions and side effects
    • Reducing time and cost of clinical trials
  3. Treatment Effectiveness

    • Analyzing which treatments work best for specific conditions
    • Personalized medicine based on patient genetics
  4. Hospital Resource Management

    • Predicting patient admission rates
    • Optimizing bed allocation and staff scheduling
  5. Medical Image Analysis

    • Detecting tumors in X-rays, MRIs, CT scans
    • Using deep learning for image classification
Healthcare Data Mining Applications
β”‚
β”œβ”€β”€β–Ί Diagnosis
β”‚    β”œβ”€β”€β–Ί Disease prediction
β”‚    β”œβ”€β”€β–Ί Risk assessment
β”‚    └──► Symptom analysis
β”‚
β”œβ”€β”€β–Ί Treatment
β”‚    β”œβ”€β”€β–Ί Drug effectiveness
β”‚    β”œβ”€β”€β–Ί Personalized medicine
β”‚    └──► Treatment planning
β”‚
β”œβ”€β”€β–Ί Operations
β”‚    β”œβ”€β”€β–Ί Resource allocation
β”‚    β”œβ”€β”€β–Ί Cost reduction
β”‚    └──► Quality improvement
β”‚
└──► Research
     β”œβ”€β”€β–Ί Drug discovery
     β”œβ”€β”€β–Ί Clinical trials
     └──► Epidemiology

Social Impacts of Data Mining

PYQ: What are the social impacts of data mining? (2022-Jul, 7 marks)

Positive Impacts:

ImpactDescription
Improved ServicesBetter healthcare, personalized education
Crime PreventionPredictive policing, fraud detection
Scientific DiscoveryNew drug discoveries, climate analysis
Economic GrowthBusiness optimization, market insights
Public HealthDisease outbreak prediction, health trends

Negative Impacts:

ImpactDescription
Privacy ConcernsPersonal data collection without consent
DiscriminationBiased algorithms affecting certain groups
Job DisplacementAutomation replacing human workers
SurveillanceMass monitoring of citizens
Data BreachesSensitive information exposure

Ethical Considerations:

  • Transparency: Users should know how their data is used
  • Consent: Explicit permission before data collection
  • Anonymization: Protecting individual identities
  • Fairness: Avoiding discriminatory outcomes
  • Accountability: Clear responsibility for data misuse

Spatial Data Mining

PYQ: Define data mining. Discuss spatial data mining. (2022-Dec, 15 marks)

Spatial Data Mining is the process of discovering interesting and useful patterns from large spatial databases (geographic data). It involves analyzing spatial relationships, distributions, and trends in data that has a geographic or spatial component.

Characteristics of Spatial Data:

  • Contains location information (coordinates, addresses)
  • Has spatial relationships (distance, adjacency, containment)
  • Often combined with non-spatial attributes

Spatial Data Mining Techniques:

TechniqueDescriptionExample
Spatial ClusteringGrouping nearby objectsIdentifying crime hotspots
Spatial ClassificationCategorizing based on locationLand use classification
Spatial AssociationFinding co-location patternsStores often near gas stations
Spatial Outlier DetectionFinding unusual spatial patternsDetecting unusual traffic patterns

Applications:

  • Urban Planning: Analyzing traffic patterns, zoning decisions
  • Environmental Science: Climate change analysis, pollution monitoring
  • Epidemiology: Disease spread analysis, healthcare accessibility
  • Marketing: Location-based advertising, store placement
  • Agriculture: Crop yield prediction, soil analysis

1.2 Basic Concepts in Data Mining

Types of Data

Data Types in Data Mining
β”‚
β”œβ”€β”€β–Ί Structured Data
β”‚    β”œβ”€β”€β–Ί Relational databases
β”‚    β”œβ”€β”€β–Ί Spreadsheets
β”‚    └──► CSV files
β”‚
β”œβ”€β”€β–Ί Semi-Structured Data
β”‚    β”œβ”€β”€β–Ί XML documents
β”‚    β”œβ”€β”€β–Ί JSON files
β”‚    └──► Email
β”‚
β”œβ”€β”€β–Ί Unstructured Data
β”‚    β”œβ”€β”€β–Ί Text documents
β”‚    β”œβ”€β”€β–Ί Images/Videos
β”‚    β”œβ”€β”€β–Ί Audio files
β”‚    └──► Web pages
β”‚
└──► Time-Series Data
     β”œβ”€β”€β–Ί Stock prices
     β”œβ”€β”€β–Ί Sensor data
     └──► Weather data

Data Mining Tasks

Task TypeDescriptionAlgorithms
PredictivePredict unknown valuesRegression, Classification, Neural Networks
DescriptiveFind patterns in dataClustering, Association Rules, Summarization

Common Data Mining Algorithms

  1. Decision Trees (C4.5, CART, ID3)

    • Tree-like model for classification
    • Easy to interpret and visualize
  2. K-Means Clustering

    • Partitions data into K clusters
    • Based on distance from centroids
  3. Apriori Algorithm

    • Finds frequent itemsets
    • Used for association rule mining
  4. Naive Bayes

    • Probabilistic classifier
    • Based on Bayes' theorem
  5. Support Vector Machines (SVM)

    • Finds optimal hyperplane for classification
    • Effective for high-dimensional data
  6. Neural Networks

    • Mimics human brain structure
    • Good for complex pattern recognition

1.3 Web Mining vs Data Mining

PYQ: Write down the difference between data mining and web mining. (2022-Feb, 7 marks)
PYQ: Web mining versus data mining. (2022-Dec, 7 marks; 2023, 7.5 marks)
PYQ: Discuss the differences and similarities between web mining and data mining. Also, list various applications of web mining. (2024-Dec, 8 marks)

Web Mining and Data Mining are closely related fields, but they focus on different types of data and have distinct challenges.

Comparison Table

AspectData MiningWeb Mining
DefinitionExtracting patterns from structured databasesExtracting patterns from web data
Data SourceDatabases, data warehousesWeb pages, web logs, hyperlinks
Data TypeMostly structuredStructured, semi-structured, unstructured
Data VolumeLarge but manageableExtremely large and growing
Data NatureStatic or slowly changingHighly dynamic and volatile
ComplexityRelatively homogeneousHighly heterogeneous
TechniquesClassification, clustering, associationWeb content, structure, usage mining
ChallengesData quality, scalabilityNoise, redundancy, heterogeneity
ApplicationsBusiness intelligence, healthcareSearch engines, recommendation systems

Key Differences

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA MINING vs WEB MINING                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚         DATA MINING         β”‚           WEB MINING              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β€’ Structured data           β”‚ β€’ Unstructured/semi-structured    β”‚
β”‚ β€’ Databases/Warehouses      β”‚ β€’ Web pages/Logs/Links            β”‚
β”‚ β€’ Controlled environment    β”‚ β€’ Open, distributed environment   β”‚
β”‚ β€’ Known schema              β”‚ β€’ No fixed schema                 β”‚
β”‚ β€’ Quality data              β”‚ β€’ Noisy, redundant data           β”‚
β”‚ β€’ Slower updates            β”‚ β€’ Rapidly changing                β”‚
β”‚ β€’ Domain-specific           β”‚ β€’ Cross-domain                    β”‚
β”‚ β€’ SQL-based queries         β”‚ β€’ Crawlers and parsers            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Similarities

  1. Pattern Discovery: Both aim to find hidden patterns
  2. Algorithms: Many algorithms are shared (clustering, classification)
  3. Knowledge Extraction: Both convert data into actionable knowledge
  4. Preprocessing Required: Both need data cleaning and transformation
  5. Iterative Process: Both follow iterative refinement
  6. Business Value: Both provide competitive advantages

Why Web Mining is More Challenging

ChallengeExplanation
HeterogeneityWeb data comes in multiple formats (HTML, XML, JSON, multimedia)
ScaleBillions of web pages, constantly growing
DynamismWeb content changes frequently
NoiseAdvertisements, navigation elements, irrelevant content
RedundancySame information appears on multiple pages
QualityNo quality control; anyone can publish
PrivacyUser tracking raises ethical concerns

1.4 Discovering Knowledge from Hypertext Data

PYQ: How to discover knowledge from hypertext data? Discuss in detail with a suitable example. (2022-Feb, 8 marks)
PYQ: Discovering knowledge from hypertext data. (2023, 7.5 marks)
PYQ: Explain the process to discover knowledge from hypertext data. (2024-May, 15 marks)
PYQ: How is knowledge discovered from hypertext data, and what are the key challenges involved in the process? (2024-Dec, 7 marks)

What is Hypertext Data?

PYQ: Write short notes on Hypertext data. (2023, 2.5 marks; 2024-May, 2.5 marks)

Hypertext is text displayed on a computer or electronic device with references (hyperlinks) to other text that the reader can immediately access. Hypertext data includes:

  • HTML Documents: Web pages with text, images, and links
  • Hyperlinks: Connections between documents
  • Anchor Text: Clickable text in hyperlinks
  • Metadata: Title, keywords, descriptions
  • Structure: DOM (Document Object Model) hierarchy
Hypertext Document Structure
β”‚
β”œβ”€β”€β–Ί Content Elements
β”‚    β”œβ”€β”€β–Ί Text content
β”‚    β”œβ”€β”€β–Ί Images
β”‚    β”œβ”€β”€β–Ί Tables
β”‚    └──► Forms
β”‚
β”œβ”€β”€β–Ί Structural Elements
β”‚    β”œβ”€β”€β–Ί Headers (H1-H6)
β”‚    β”œβ”€β”€β–Ί Paragraphs
β”‚    β”œβ”€β”€β–Ί Lists
β”‚    └──► Divisions
β”‚
β”œβ”€β”€β–Ί Link Elements
β”‚    β”œβ”€β”€β–Ί Internal links
β”‚    β”œβ”€β”€β–Ί External links
β”‚    β”œβ”€β”€β–Ί Anchor text
β”‚    └──► Navigation menus
β”‚
└──► Metadata
     β”œβ”€β”€β–Ί Title
     β”œβ”€β”€β–Ί Description
     β”œβ”€β”€β–Ί Keywords
     └──► Author

Knowledge Discovery Process from Hypertext

The process of discovering knowledge from hypertext data involves several steps:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       KNOWLEDGE DISCOVERY FROM HYPERTEXT DATA      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  Data Collection (Crawling)   β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚ Data Preprocessing & Cleaning β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚      Feature Extraction       β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Data Analysis & Mining      β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚      Pattern Discovery        β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Knowledge Representation    β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step-by-Step Process:

Step 1: Data Collection (Web Crawling)
  • Use web crawlers/spiders to fetch web pages
  • Follow hyperlinks to discover new pages
  • Store HTML content for processing
Web Crawler Architecture
β”‚
β”œβ”€β”€β–Ί Seed URLs
β”‚    └──► Initial list of URLs to crawl
β”‚
β”œβ”€β”€β–Ί URL Frontier
β”‚    └──► Queue of URLs to be visited
β”‚
β”œβ”€β”€β–Ί Fetcher
β”‚    └──► Downloads web pages
β”‚
β”œβ”€β”€β–Ί Parser
β”‚    └──► Extracts links and content
β”‚
└──► Repository
     └──► Stores crawled pages
Step 2: Data Preprocessing
  • HTML Parsing: Extract text from HTML tags
  • Noise Removal: Remove ads, navigation, boilerplate
  • Tokenization: Break text into words/tokens
  • Stop Word Removal: Remove common words (the, is, at)
  • Stemming/Lemmatization: Reduce words to root form
Step 3: Feature Extraction
Feature TypeDescriptionExample
Content FeaturesText, keywords, topicsTF-IDF scores
Structural FeaturesHTML tags, headingsH1 tags, bold text
Link FeaturesInlinks, outlinks, anchor textPageRank score
Metadata FeaturesTitle, description, keywordsMeta tags
Step 4: Pattern Discovery

Apply mining techniques to discover:

  1. Content Patterns

    • Topic extraction
    • Keyword clustering
    • Document classification
  2. Link Patterns

    • Hub and authority pages
    • Community detection
    • Link prediction
  3. Usage Patterns

    • Navigation paths
    • User sessions
    • Click patterns
Step 5: Knowledge Representation
  • Ontologies: Formal knowledge representation
  • Knowledge Graphs: Entity-relationship networks
  • Taxonomies: Hierarchical classification
  • Rules: If-then patterns

Example: Knowledge Discovery from News Websites

Scenario: Discovering trending topics from news websites

Step 1: Crawl News Sites
        β”œβ”€β”€β–Ί CNN, BBC, Reuters, etc.
        └──► Collect articles, headlines, links

Step 2: Preprocess Data
        β”œβ”€β”€β–Ί Extract article text
        β”œβ”€β”€β–Ί Remove ads and navigation
        └──► Tokenize and clean text

Step 3: Extract Features
        β”œβ”€β”€β–Ί Keywords (TF-IDF)
        β”œβ”€β”€β–Ί Named entities (people, places)
        └──► Categories (politics, sports)

Step 4: Discover Patterns
        β”œβ”€β”€β–Ί Topic modeling (LDA)
        β”œβ”€β”€β–Ί Trend detection
        └──► Sentiment analysis

Step 5: Generate Knowledge
        β”œβ”€β”€β–Ί "Technology" trending this week
        β”œβ”€β”€β–Ί "Climate change" frequently linked to "politics"
        └──► Positive sentiment around "sports events"

Challenges in Knowledge Discovery from Hypertext

PYQ: What are the key challenges involved in the process? (2024-Dec, 7 marks)

ChallengeDescriptionSolution
HeterogeneityDifferent formats, languages, structuresUse NLP, multilingual processing
NoiseAds, navigation, irrelevant contentContent extraction algorithms
ScaleBillions of pagesDistributed computing (MapReduce)
DynamismFrequent content changesIncremental crawling
AmbiguitySame word, different meaningsWord sense disambiguation
Link SpamFake links to manipulate rankingsSpam detection algorithms
Deep WebContent behind forms/loginsSpecialized crawlers
MultimediaImages, videos, audioMultimodal analysis

Algorithms for Hypertext Knowledge Discovery

  1. PageRank

    • Measures page importance based on links
    • Used by Google for search ranking
  2. HITS (Hyperlink-Induced Topic Search)

    • Identifies hubs (link to many) and authorities (linked by many)
    • Useful for topic-specific searches
  3. TF-IDF (Term Frequency-Inverse Document Frequency)

    • Measures word importance in documents
    • Used for keyword extraction
  4. Latent Dirichlet Allocation (LDA)

    • Topic modeling algorithm
    • Discovers hidden topics in documents

Section 2: Overview of Web Mining

2.1 What is Web Mining?

PYQ: What do you mean by web mining? What are its types? (2022-Feb, 8 marks)
PYQ: Define web mining. Explain its various issues and challenges. (2022-Jul, 15 marks)
PYQ: Write short notes on Web mining. (2022-Dec, 3 marks; 2024-May, 2.5 marks)

Definition of Web Mining

Web Mining is the application of data mining techniques to extract knowledge from web data, including web content, web structure, and web usage data.

"Web Mining is the use of data mining techniques to automatically discover and extract information from web documents and services." β€” Etzioni (1996)

Why Web Mining?

ReasonExplanation
Information OverloadBillions of web pages; need automated analysis
Business IntelligenceUnderstanding customer behavior online
PersonalizationCustomizing content for users
Search ImprovementBetter search engine results
E-commerceProduct recommendations, market analysis
SecurityDetecting web spam, phishing, fraud

Components of Web Mining

Web Mining Components
β”‚
β”œβ”€β”€β–Ί Data Sources
β”‚    β”œβ”€β”€β–Ί Web pages (HTML, XML)
β”‚    β”œβ”€β”€β–Ί Server logs
β”‚    β”œβ”€β”€β–Ί User profiles
β”‚    β”œβ”€β”€β–Ί Hyperlinks
β”‚    └──► Metadata
β”‚
β”œβ”€β”€β–Ί Techniques
β”‚    β”œβ”€β”€β–Ί Information retrieval
β”‚    β”œβ”€β”€β–Ί Natural language processing
β”‚    β”œβ”€β”€β–Ί Machine learning
β”‚    β”œβ”€β”€β–Ί Database querying
β”‚    └──► Statistical analysis
β”‚
└──► Applications
     β”œβ”€β”€β–Ί Search engines
     β”œβ”€β”€β–Ί Recommendation systems
     β”œβ”€β”€β–Ί Personalization
     β”œβ”€β”€β–Ί Web analytics
     └──► Sentiment analysis

2.2 Web Mining Taxonomy

PYQ: Explain web mining taxonomy, its issues, and challenges. (2024-May, 15 marks)

What is Web Mining Taxonomy?

Web Mining Taxonomy classifies web mining into different categories based on the type of data being mined and the techniques used. It helps in understanding the various aspects of web mining and their specific applications.

Classification of Web Mining

Web Mining is categorized into three main types based on the type of data being mined:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 WEB MINING TAXONOMY               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
β”‚   WEB     β”‚       β”‚   WEB     β”‚       β”‚   WEB     β”‚
β”‚ CONTENT   β”‚       β”‚ STRUCTURE β”‚       β”‚  USAGE    β”‚
β”‚  MINING   β”‚       β”‚  MINING   β”‚       β”‚  MINING   β”‚
β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜
β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”
β”‚β€’ Text     β”‚       β”‚β€’ Hyperlinksβ”‚      β”‚β€’ Server   β”‚
β”‚β€’ Images   β”‚       β”‚β€’ Document  β”‚      β”‚  Logs     β”‚
β”‚β€’ Audio    β”‚       β”‚  Structure β”‚      β”‚β€’ Cookies  β”‚
β”‚β€’ Video    β”‚       β”‚β€’ Web Graph β”‚      β”‚β€’ Sessions β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Web Content Mining (WCM)

Definition: Mining the content of web pages (text, images, audio, video).

Data Sources:

  • HTML/XML documents
  • Text content
  • Multimedia (images, videos, audio)
  • Structured data (tables, lists)

Techniques:

TechniqueDescription
Text MiningExtracting information from text
NLPUnderstanding natural language
Information ExtractionExtracting structured data
Topic ModelingDiscovering hidden topics
Sentiment AnalysisDetermining opinions/emotions

Applications:

  • Search engines (Google, Bing)
  • Question answering systems
  • Document summarization
  • Content categorization

2. Web Structure Mining (WSM)

Definition: Mining the hyperlink structure of the web to discover useful patterns.

Data Sources:

  • Hyperlinks (inlinks, outlinks)
  • Anchor text
  • Document structure (DOM)
  • Site structure

Techniques:

TechniqueDescription
Link AnalysisAnalyzing hyperlink patterns
PageRankMeasuring page importance
HITS AlgorithmFinding hubs and authorities
Community DetectionFinding related page groups

Applications:

  • Search engine ranking
  • Web page classification
  • Finding authoritative sources
  • Detecting web spam

3. Web Usage Mining (WUM)

Definition: Mining user access patterns from web server logs and user data.

Data Sources:

  • Web server logs
  • Proxy server logs
  • Browser cookies
  • User profiles
  • Click-through data

Techniques:

TechniqueDescription
Session AnalysisIdentifying user sessions
Path AnalysisFinding navigation patterns
Clickstream MiningAnalyzing click sequences
Collaborative FilteringFinding similar users

Applications:

  • Personalization
  • Recommendation systems
  • Website optimization
  • User behavior prediction

Comparison of Web Mining Types

AspectContent MiningStructure MiningUsage Mining
DataPage contentHyperlinksServer logs
FocusWhat is saidHow pages connectHow users behave
ViewIntra-pageInter-pageUser interaction
AlgorithmsNLP, MLGraph algorithmsSequential patterns
OutputTopics, entitiesRankings, communitiesPatterns, profiles

2.3 Web Mining Subtasks

PYQ: What are web mining subtasks? Discuss in detail with a suitable example. (2022-Feb, 7 marks)
PYQ: Explain web mining subtasks, issues, and challenges. (2023, 15 marks)

Overview of Web Mining Subtasks

Web Mining consists of several interconnected subtasks that work together to extract knowledge from web data:

Web Mining Subtasks
β”‚
β”œβ”€β”€β–Ί Resource Discovery
β”‚    └──► Finding relevant web resources
β”‚
β”œβ”€β”€β–Ί Information Selection & Preprocessing
β”‚    └──► Selecting and preparing data
β”‚
β”œβ”€β”€β–Ί Generalization
β”‚    └──► Discovering patterns
β”‚
└──► Analysis
     └──► Interpreting results

Subtask 1: Resource Discovery

Purpose: Locating and retrieving relevant web documents and resources.

Activities:

  • Web Crawling: Automated traversal of the web to collect pages
  • Focused Crawling: Targeting specific topics or domains
  • Deep Web Access: Retrieving content behind forms and databases
  • API Integration: Collecting data from web services

Tools:

  • Web crawlers (Scrapy, Apache Nutch)
  • Search engine APIs (Google, Bing)
  • Web scraping tools (BeautifulSoup, Selenium)

Subtask 2: Information Selection and Preprocessing

Purpose: Extracting and cleaning relevant information from collected resources.

Activities:

ActivityDescription
HTML ParsingExtracting content from HTML tags
Noise RemovalRemoving ads, navigation, scripts
Text ExtractionGetting clean text content
Feature SelectionIdentifying important attributes
Data TransformationConverting to suitable format

Example Process:

    Raw HTML Page
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Remove HTML tags β”‚
β”‚ Remove scripts   β”‚
β”‚ Remove CSS       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Extract text     β”‚
β”‚ Tokenize         β”‚
β”‚ Remove stopwords β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Stemming         β”‚
β”‚ Feature vectors  β”‚
β”‚ TF-IDF scores    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Subtask 3: Generalization (Pattern Discovery)

Purpose: Applying data mining algorithms to discover patterns and knowledge.

Techniques Used:

TechniqueApplication in Web Mining
ClusteringGrouping similar web pages
ClassificationCategorizing web content
Association RulesFinding co-occurring elements
Sequential PatternsDiscovering navigation paths
Link AnalysisFinding important pages

Example: Clustering news articles by topic

Input: 1000 news articles
    β”‚
    β–Ό
K-Means Clustering (k=10)
    β”‚
    β–Ό
Output: 10 topic clusters
β”œβ”€β”€ Cluster 1: Politics (150 articles)
β”œβ”€β”€ Cluster 2: Sports (200 articles)
β”œβ”€β”€ Cluster 3: Technology (120 articles)
β”œβ”€β”€ Cluster 4: Entertainment (180 articles)
└── ... and so on

Subtask 4: Analysis and Validation

Purpose: Interpreting discovered patterns and validating their usefulness.

Activities:

  • Pattern Evaluation: Assessing pattern quality and significance
  • Visualization: Presenting results graphically
  • Validation: Verifying patterns against domain knowledge
  • Interpretation: Understanding what patterns mean
  • Application: Using knowledge for decision-making

Metrics for Evaluation:

  • Support: How frequently a pattern occurs
  • Confidence: Reliability of the pattern
  • Lift: Strength of association
  • Precision/Recall: Accuracy of classification

2.4 Web Mining Issues and Challenges

PYQ: Define web mining. Explain its various issues and challenges. (2022-Jul, 15 marks)
PYQ: What are the key issues in web mining? (2022-Feb, 8 marks)
PYQ: Discuss the key issues and challenges faced in web mining. (2024-Dec, 8 marks)

Major Issues in Web Mining

While web mining offers significant opportunities, it also presents several challenges:

Web Mining Issues
β”‚
β”œβ”€β”€β–Ί Data-Related Issues
β”‚    β”œβ”€β”€β–Ί Volume (scale)
β”‚    β”œβ”€β”€β–Ί Variety (heterogeneity)
β”‚    β”œβ”€β”€β–Ί Velocity (dynamism)
β”‚    └──► Veracity (quality)
β”‚
β”œβ”€β”€β–Ί Technical Issues
β”‚    β”œβ”€β”€β–Ί Crawling efficiency
β”‚    β”œβ”€β”€β–Ί Storage requirements
β”‚    β”œβ”€β”€β–Ί Processing complexity
β”‚    └──► Algorithm scalability
β”‚
β”œβ”€β”€β–Ί Semantic Issues
β”‚    β”œβ”€β”€β–Ί Ambiguity
β”‚    β”œβ”€β”€β–Ί Context understanding
β”‚    β”œβ”€β”€β–Ί Multilingual content
β”‚    └──► Synonymy/Polysemy
β”‚
└──► Ethical Issues
     β”œβ”€β”€β–Ί Privacy concerns
     β”œβ”€β”€β–Ί Copyright violations
     └──► Data misuse

Detailed Challenges

ChallengeDescriptionImpactSolution Approach
ScalabilityBillions of web pages to processHigh computational costDistributed computing, sampling
HeterogeneityDifferent formats (HTML, PDF, images)Complex preprocessingMulti-format parsers, NLP
DynamismWeb content changes frequentlyOutdated informationIncremental crawling, change detection
NoiseAds, navigation, irrelevant contentPoor mining qualityContent extraction algorithms
RedundancySame information on multiple pagesWasted resourcesDuplicate detection, deduplication
SparsityUseful information is sparseLow signal-to-noise ratioFeature selection, filtering
PrivacyUser data collection concernsLegal and ethical issuesAnonymization, consent mechanisms
SpamFake content to manipulate rankingsMisleading resultsSpam detection algorithms
Deep WebContent behind logins/formsIncomplete coverageSpecialized crawlers, APIs
MultilingualContent in many languagesComplex analysisMachine translation, multilingual NLP

Technical Challenges in Detail

1. Web Crawling Challenges
IssueDescription
PolitenessRespecting robots.txt and server limits
FreshnessKeeping crawled data up-to-date
CoverageReaching all relevant pages
EfficiencyMinimizing bandwidth and time
TrapsAvoiding infinite loops (spider traps)
2. Data Quality Challenges
Data Quality Problems
β”‚
β”œβ”€β”€β–Ί Incomplete Data
β”‚    └──► Missing pages, broken links
β”‚
β”œβ”€β”€β–Ί Inconsistent Data
β”‚    └──► Same entity, different representations
β”‚
β”œβ”€β”€β–Ί Inaccurate Data
β”‚    └──► Outdated, incorrect information
β”‚
β”œβ”€β”€β–Ί Noisy Data
β”‚    └──► Ads, boilerplate, irrelevant content
β”‚
└──► Biased Data
     └──► Over-representation of popular sites
3. Semantic Challenges
ChallengeExample
Synonymy"car" = "automobile" = "vehicle"
Polysemy"bank" = financial institution OR river bank
Context"Apple" = fruit OR company
Sarcasm"Great service!" (could be negative)
Implicit meaningUnderstanding unstated information

Solutions to Web Mining Challenges

ChallengeSolution
ScaleMapReduce, Spark, distributed systems
HeterogeneityUniversal parsers, format converters
DynamismChange detection, incremental updates
NoiseMachine learning classifiers, DOM analysis
PrivacyDifferential privacy, anonymization
SpamLink analysis, content-based detection
MultilingualNeural machine translation, multilingual embeddings

2.5 Applications of Web Mining

PYQ: What are some common applications of web mining? How do they benefit from web mining techniques to improve decision-making processes? (2024-Dec, 7 marks)
PYQ: List various applications of web mining. (2024-Dec, 8 marks)

Major Applications

DomainApplicationBenefit
Search EnginesGoogle, Bing, DuckDuckGoRelevant search results
E-commerceAmazon, Flipkart recommendationsIncreased sales
Social MediaTrend detection, sentiment analysisUser engagement
MarketingCustomer segmentation, targetingBetter ROI
SecurityFraud detection, phishing preventionRisk reduction
HealthcareMedical information retrievalBetter patient care
EducationE-learning personalizationImproved outcomes
NewsTopic tracking, fake news detectionInformed readers

Application Examples

1. Search Engine Optimization (SEO)
  • Analyzing web structure for ranking
  • Keyword analysis and optimization
  • Competitor analysis
2. Recommendation Systems
  • Product recommendations (Amazon)
  • Content recommendations (Netflix, YouTube)
  • Friend suggestions (Facebook, LinkedIn)
3. Web Analytics
  • User behavior analysis
  • Conversion optimization
  • A/B testing insights
4. Competitive Intelligence
  • Market trend analysis
  • Competitor monitoring
  • Price comparison
5. Customer Relationship Management (CRM)
  • Customer profiling
  • Churn prediction
  • Personalized marketing

Summary Table: Unit 1 Key Concepts

TopicKey Points
Data MiningKDD process, techniques (classification, clustering), applications
Web Mining vs Data MiningData source, structure, dynamism, challenges
Hypertext Knowledge DiscoveryCrawling, preprocessing, feature extraction, pattern discovery
Web Mining TaxonomyContent mining, structure mining, usage mining
Web Mining SubtasksResource discovery, preprocessing, generalization, analysis
Issues & ChallengesScale, heterogeneity, dynamism, noise, privacy, spam
ApplicationsSearch engines, e-commerce, social media, security

Quick Revision: Important Definitions

TermDefinition
Data MiningExtracting patterns from large databases using statistical and ML techniques
Web MiningApplying data mining to extract knowledge from web data
HypertextText with hyperlinks to other documents
Web Content MiningMining the content of web pages
Web Structure MiningMining the hyperlink structure of the web
Web Usage MiningMining user access patterns from logs
Knowledge DiscoveryProcess of extracting useful knowledge from data
Web CrawlerProgram that automatically traverses the web

Expected Questions for Exam

15 Marks Questions

  1. Web Mining vs Data Mining (with comparison table)
  2. Knowledge Discovery from Hypertext Data (with process diagram)
  3. Web Mining Taxonomy (all three types in detail)
  4. Web Mining Issues and Challenges (comprehensive list)

7-8 Marks Questions

  1. Applications of Data Mining in Healthcare
  2. Web Mining Subtasks
  3. Social Impacts of Data Mining
  4. Spatial Data Mining

2.5-3 Marks Questions

  1. Define Mining / Data Mining / Web Mining
  2. What is Hypertext Data?
  3. Types of Web Mining
  4. Any two challenges in Web Mining

These notes were compiled by Deepak Modi
Last updated: December 2025

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!