BDASemester 8

Unit 1: Introduction to Big Data

Why and Where Big Data, Applications and Challenges, Characteristics (6 V's), Dimensions of Scalability, Data Science process, Foundations of Big Data Systems, and Distributed File Systems.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Introduction to Big Data: Big Data: Why and Where, Application and Challenges, Characteristics of Big Data and Dimensions of Scalability, The Six V, Data Science: Getting Value out of Big Data, Steps in the Data science process, Foundations for Big Data Systems and Programming, Distributed file systems.


🎯 PYQ Analysis for Unit 1

High Priority Topics (15 marks questions)

  1. 6 V's of Big Data (Characteristics) β€” (2024: 15 marks, 2023: 15 marks, 2022: 15 marks)
  2. HDFS β€” NameNode, DataNode, Blocks, Operations & Commands β€” (2023: 15 marks, 2022: 15 marks) β†’ Section 7 + 7.8
  3. Big Data Definition + Characteristics + Applications + Challenges β€” (2024: 15 marks, 2023: 15 marks, 2022: 15 marks)
  4. Data Science Process (steps + real-world example) β€” (2024: 15 marks) β†’ Section 5 + 5.4
  5. Big Data Platform β€” Main Features (5 characteristics) β€” (2022: 15 marks) β†’ Section 9
  6. Foundation for Big Data System β€” (2023: 15 marks) β†’ Section 6
  7. Big Data Analytics Techniques (A/B testing, Data Mining, ML, NLP, Statistics, Data Fusion) β€” (2022: 15 marks) β†’ Section 8
  8. Types of Data β€” Measurement Scales (Nominal / Ordinal / Interval / Ratio) β€” (2022: 15 marks) β†’ Section 10
  9. Challenges in Big Data β€” (2022: 15 marks) β†’ Section 2.2

Medium Priority Topics (Short answers)

  1. Six V in Big Data β€” 2024 (2.5), 2022 (2.5)
  2. Big Data definition β€” 2022 (2.5)
  3. YARN β€” Components & Features β€” 2022 (2.5) β†’ Section 6.4
  4. Data Sciences β€” 2023 (2.5)
  5. DFS (Distributed File System) β€” 2023 (2.5)
  6. 5 challenges of Big Data β€” 2024 (2.5)
  7. Hadoop (short note β€” HDFS + MapReduce + YARN) β€” 2024 (7.5 marks) β†’ Section 11

Section 1: Big Data β€” Why and Where

1.1 What is Big Data?

PYQ: Explain Big Data. (2022, 2.5 marks)
PYQ: What is Big Data? Explain various characteristics, challenges and applications of Big Data. (2023, 15 marks)
PYQ: Define different techniques in Big Data analytics. (2022, 15 marks)

Definition:

Big Data refers to extremely large and complex datasets that cannot be processed, stored, or analyzed using traditional data management tools (like regular databases or spreadsheets) within a reasonable time.

The term "Big Data" describes data that is:

  • Too large to fit on one machine.
  • Generated too fast to process in real time with traditional systems.
  • Too varied in format to fit in a single table.

Simple Definition: Big Data is data that is so big, fast, or complex that traditional methods can't handle it.

Real-Life Scale:

Every day:
  Google processes  β†’ 8.5 billion searches
  YouTube receives  β†’ 500 hours of video per minute uploaded
  Facebook generates→ 4 petabytes of data
  Twitter sees      β†’ 500 million tweets
  WhatsApp sends    β†’ 100 billion messages

1.2 Why Big Data? (The Need)

Data has always existed, but several forces have exploded the volume beyond what old tools can handle:

1. Digitization of Everything:

  • Everything is now recorded digitally β€” transactions, clicks, GPS location, health sensors.
  • Every smartphone is a data generator.

2. Internet and Social Media:

  • Billions of people use the internet every day.
  • Each like, comment, post, search, and purchase generates data.

3. IoT (Internet of Things):

  • Smart devices (sensors, wearables, smart appliances) continuously stream data.
  • Example: A smart factory has thousands of sensors generating data every second.

4. E-commerce and Digital Transactions:

  • Every purchase, product view, and abandoned cart is recorded.
  • Amazon processes millions of transactions per day.

5. Healthcare and Science:

  • Genomics, medical imaging, clinical trials generate enormous datasets.
  • A single human genome = ~3 GB of data.

6. Cheaper Storage and Computing:

  • Cost of storing 1 GB dropped from thousands of dollars (1980) to fractions of a cent (now).
  • Cloud computing made large-scale processing affordable.

1.3 Where Does Big Data Come From? (Sources)

Data Sources:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚       Big Data Sources   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό              β–Ό                 β–Ό              β–Ό
   Social Media    Machine / IoT    Transactional    Web / Logs
   ───────────    ─────────────    ─────────────    ──────────
   Facebook       Sensors          Bank records     Clickstreams
   Twitter        GPS devices      E-commerce       Server logs
   Instagram      Smart meters     Healthcare       Search queries
   LinkedIn       RFID tags        Insurance        App usage

Categories of Data Sources:

Source TypeExamplesData Type
Social MediaFacebook, Twitter, YouTubeText, images, video
Machine / SensorIoT devices, GPS, RFIDNumeric streams
TransactionBanks, e-commerce, hospitalsStructured records
WebLogs, click-streams, searchSemi-structured
ScientificGenomics, astronomy, climateNumeric, images
EnterpriseERP, CRM, supply chainStructured

Section 2: Applications and Challenges

PYQ: Discuss the following in detail: (i) Challenges in big data (ii) Types of Data. (2022, 15 marks)
PYQ: Enlist 5 challenges associated with managing and analyzing large volumes of data. (2024, 2.5 marks)
PYQ: Describe any five real life applications of Big Data. (2022, 15 marks)
PYQ: Write short note on real life applications of Big Data. (2023, 15 marks)

2.1 Applications of Big Data

1. Healthcare:

  • Disease prediction β€” analyzing patient history to predict illness.
  • Drug discovery β€” finding drug interactions using genetic data.
  • Epidemic tracking β€” monitoring disease spread (COVID-19 dashboards).
  • Personalized medicine β€” treatment based on individual genome.

2. Finance and Banking:

  • Fraud detection β€” real-time anomaly detection on credit card transactions.
  • Risk assessment β€” evaluating loan default probability.
  • Algorithmic trading β€” high-frequency trading decisions from market data.
  • Customer 360 β€” full customer profile from all touchpoints.

3. Retail and E-commerce:

  • Recommendation engines β€” Amazon/Netflix "you may also like".
  • Demand forecasting β€” predicting inventory needs.
  • Price optimization β€” dynamic pricing based on demand.
  • Customer sentiment β€” analyzing reviews and social media.

4. Transportation:

  • Traffic management β€” real-time rerouting (Google Maps).
  • Predictive maintenance β€” airlines predicting engine failures before they happen.
  • Ride-sharing optimization β€” Uber/Ola surge pricing and driver allocation.
  • Self-driving cars β€” processing sensor data in real time.

5. Government and Smart Cities:

  • Public safety β€” crime hotspot prediction.
  • Energy management β€” smart grid optimization.
  • Citizen services β€” tax fraud detection, welfare eligibility.
  • Disaster response β€” resource allocation using real-time data.

6. Telecommunications:

  • Churn prediction β€” identify customers likely to switch providers.
  • Network optimization β€” identifying bottlenecks using call data records.
  • Personalized plans β€” usage-based plan recommendations.

7. Manufacturing:

  • Predictive maintenance β€” prevent machine breakdowns using sensor data.
  • Quality control β€” defect detection using computer vision.
  • Supply chain optimization β€” reducing delays and wastage.

8. Agriculture:

  • Crop yield prediction using satellite + weather data.
  • Precision farming β€” targeted irrigation and fertilization.
  • Disease detection β€” identifying crop disease from drone imagery.

2.2 Challenges of Big Data

Despite its potential, Big Data comes with significant challenges:

1. Storage:

  • Petabytes and exabytes of data need massive, scalable storage.
  • Traditional RDBMS cannot store semi-structured or unstructured data.

2. Processing Speed:

  • Data arrives faster than it can be processed (streaming data).
  • Batch processing too slow for real-time use cases.

3. Data Variety:

  • Data comes in structured, semi-structured, and unstructured formats.
  • Integrating text, images, video, JSON, XML, CSV is complex.

4. Data Quality:

  • Raw big data is often dirty β€” missing values, duplicates, inconsistencies.
  • "Garbage In, Garbage Out" β€” poor quality input = poor insights.

5. Privacy and Security:

  • Handling sensitive data (medical, financial) requires strict compliance.
  • Risk of data breaches at scale.
  • GDPR, HIPAA, and other regulations must be followed.

6. Talent Gap:

  • Shortage of skilled data engineers, data scientists, and architects.

7. Cost:

  • Infrastructure (servers, cloud, bandwidth) is expensive.
  • ROI must justify the investment.

8. Data Governance:

  • Deciding who owns data, who can access it, and how long to keep it.

Section 3: Characteristics of Big Data β€” The 6 V's

PYQ: Six V in big data. (2022, 2.5 marks)
PYQ: Briefly elaborate the six V of big data. (2024, 2.5 marks)
PYQ: Explain six V's of Big Data in detail. (2023, 15 marks)
PYQ: Explain the characteristics of Big Data and discuss how they contribute to the challenges in managing large volumes of data. (2024, 15 marks)

3.1 What are the V's of Big Data?

The characteristics of Big Data are described using V's. Originally there were 3 V's (Gartner, 2001), which expanded to 6 V's over time.

                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  BIG DATA   β”‚
                     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό          β–Ό       β–Ό      β–Ό     β–Ό     β–Ό
      Volume    Velocity  Variety Veracity Value Variability

3.2 The 6 V's β€” Detailed

V1 β€” Volume

Definition: The amount / size of data being generated and stored.

  • We have moved from gigabytes β†’ terabytes β†’ petabytes β†’ exabytes.
  • Traditional systems crash under this load.

Scale Reference:

1 KB  = 1,000 bytes
1 MB  = 1,000 KB     (one song)
1 GB  = 1,000 MB     (a movie)
1 TB  = 1,000 GB     (a library of books)
1 PB  = 1,000 TB     (Facebook stores ~100 PB)
1 EB  = 1,000 PB     (all internet traffic per month)
1 ZB  = 1,000 EB     (total global data in 2020 β‰ˆ 40 ZB)

Challenge: Where to store this? β†’ Distributed File Systems (HDFS)


V2 β€” Velocity

Definition: The speed at which data is generated and must be processed.

  • Twitter generates ~6,000 tweets per second.
  • Stock exchange processes millions of trades per second.
  • IoT devices stream data continuously.

Types of Processing:

Batch Processing:   Collect data β†’ Process later (hours/days)
                    Example: Monthly billing reports

Stream Processing:  Process data as it arrives (milliseconds)
                    Example: Fraud detection, live traffic

Challenge: How to process fast enough? β†’ Apache Kafka, Apache Storm, Spark Streaming


V3 β€” Variety

Definition: The diversity of data types and formats in Big Data.

Three Types of Data:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Structured     β”‚ Semi-Structured   β”‚   Unstructured      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Fixed schema     β”‚ Flexible schema   β”‚ No fixed schema     β”‚
β”‚ Rows & columns   β”‚ Tags / key-value  β”‚ Free-form           β”‚
β”‚ SQL databases    β”‚ JSON, XML, CSV    β”‚ Text, images, video β”‚
β”‚ Example:         β”‚ Example:          β”‚ Example:            β”‚
β”‚ Bank records     β”‚ Twitter API data  β”‚ Medical images      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Challenge: How to store and query all formats? β†’ NoSQL databases, Data Lakes


V4 β€” Veracity

Definition: The quality, accuracy, and trustworthiness of the data.

  • Big Data is often noisy, incomplete, biased, or inconsistent.
  • Wrong data leads to wrong conclusions β†’ dangerous in healthcare or finance.

Sources of bad veracity:

  • Sensor malfunction β†’ wrong readings.
  • User input errors β†’ misspelled names, wrong dates.
  • Social media noise β†’ sarcasm, fake news, bots.
  • Missing values β†’ incomplete records.

Challenge: How to ensure data quality? β†’ Data cleaning, validation pipelines, data governance


V5 β€” Value

Definition: The usefulness or business benefit derived from analyzing the data.

  • Not all big data is valuable. Most raw data has low value density.
  • The goal is to extract high-value insights from low-value-density data.

Value Chain:

Raw Data β†’ Processed Data β†’ Information β†’ Knowledge β†’ Wisdom β†’ Business Value

Example: A billion web clicks have low individual value, but analyzed together they reveal purchasing patterns worth millions in targeted ads.

Challenge: How to extract value efficiently? β†’ ML models, analytics platforms, BI tools


V6 β€” Variability

Definition: The inconsistency in data β€” same data meaning different things at different times or contexts.

  • A word like "bank" can mean a financial institution or a river bank.
  • Sentiment of a tweet depends on context and time.
  • Data formats change over time.

Difference from Variety:

  • Variety = different types of data (text vs video).
  • Variability = the same type of data having inconsistent meaning or format.

Challenge: Context-aware processing, NLP, semantic analysis.


3.3 Summary Table β€” The 6 V's

VNameQuestion it AnswersChallenge
V1VolumeHow much?Storage, scalability
V2VelocityHow fast?Real-time processing
V3VarietyWhat types?Integration, NoSQL
V4VeracityHow accurate?Data quality, cleaning
V5ValueHow useful?Insight extraction
V6VariabilityHow consistent?Context, semantics

Section 4: Dimensions of Scalability

4.1 What is Scalability?

Scalability is the ability of a system to handle growing amounts of work (more data, more users, more requests) by adding resources.

4.2 Two Types of Scaling

Vertical Scaling (Scale Up)

Before:           After:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server  β”‚  ──►  β”‚ Bigger Serverβ”‚
β”‚ 8 GB    β”‚       β”‚ 64 GB RAM    β”‚
β”‚ 4 cores β”‚       β”‚ 32 cores     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Add more CPU, RAM, or disk to a single machine.
  • Simple but has physical limits.
  • Expensive at high end.
  • Single point of failure.

Horizontal Scaling (Scale Out)

Before:          After:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”
β”‚ Server  β”‚ ──►  β”‚ S1  β”‚ β”‚ S2  β”‚ β”‚ S3  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜
  • Add more machines (nodes) to a cluster.
  • Big Data systems use horizontal scaling (commodity hardware).
  • Basis of Hadoop, Spark, NoSQL.
VerticalHorizontal
HowBigger machineMore machines
CostHigh per unitLow (commodity)
LimitPhysical hardware limitVirtually unlimited
FailureSingle point of failureFault tolerant
Big Data fitPoorExcellent

4.3 Dimensions of Scalability in Big Data

DimensionWhat scales?Example
Data VolumeAmount of stored dataAdding more HDFS nodes
ThroughputData processed per secondMore Spark workers
Query SpeedResponse time for queriesPartitioning, indexing
ConcurrencySimultaneous users/jobsLoad balancing
GeographicMultiple data centersCloud regions (AWS)

Section 5: Data Science β€” Getting Value out of Big Data

PYQ: Write short note on Data Sciences. (2023, 2.5 marks)
PYQ: Describe the steps involved in the Data Science process. How does each step contribute to extracting value from Big Data? (2024, 8 marks)
PYQ: Illustrate with a real-world scenario for steps involving Data Science process. (2024, 7 marks)

5.1 What is Data Science?

Definition:

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

It combines:

  • Statistics β€” mathematical analysis
  • Computer Science β€” programming and algorithms
  • Domain Knowledge β€” understanding of the specific field

Data Science vs Big Data:

Big DataData Science
FocusInfrastructure to store and process large dataMethods to extract insights from data
ToolsHadoop, Spark, HDFSPython, R, ML, Statistics
RoleData EngineerData Scientist
OutputScalable pipelinesActionable insights

5.2 Steps in the Data Science Process

The data science process is a structured workflow for turning raw data into valuable insights.

Overview:

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 1. Problem      β”‚
  β”‚    Definition   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 2. Data         β”‚
  β”‚    Collection   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 3. Data         β”‚
  β”‚    Preparation  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 4. Exploratory  β”‚
  β”‚    Analysis     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 5. Model        β”‚
  β”‚    Building     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 6. Evaluation   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 7. Deployment   β”‚
  β”‚  & Storytelling β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5.3 Each Step Explained

Step 1: Problem Definition

  • Clearly define the business question to answer.
  • Example: "Which customers are most likely to churn next month?"
  • Bad-defined problems = wasted effort.

Step 2: Data Collection

  • Identify and gather data from relevant sources.
  • Sources: databases, APIs, web scraping, sensors, surveys.
  • Key question: Is there enough data to answer the question?

Step 3: Data Preparation (Wrangling / Cleaning)

  • Real data is messy. This is the most time-consuming step (60–80% of effort).
  • Tasks:
    • Handle missing values.
    • Remove duplicates.
    • Fix format inconsistencies.
    • Encode categorical variables.
    • Normalize / scale features.

Step 4: Exploratory Data Analysis (EDA)

  • Understand the data before building models.
  • Tasks:
    • Summary statistics (mean, median, std).
    • Data visualization (histograms, scatter plots, heatmaps).
    • Find correlations and outliers.
    • Choose appropriate ML techniques.

Step 5: Model Building

  • Apply ML or statistical algorithms to the prepared data.
  • Split data: training set + test set.
  • Try multiple models, tune hyperparameters.
  • Example: Decision tree, regression, clustering.

Step 6: Evaluation

  • Measure how well the model performs on unseen data.
  • Metrics: Accuracy, F1-score, RMSE, AUC (covered in ML Unit 4).
  • If performance is poor β†’ go back to Step 3 or 5.

Step 7: Deployment and Communication

  • Deploy the model in a production system (web API, dashboard, app).
  • Communicate findings to stakeholders using visualizations and storytelling.
  • Monitor model over time for degradation.

5.4 Real-World Scenario β€” Data Science Process Across Industries

PYQ: Illustrate with a real-world scenario for steps involving Data Science process. (2024, 7 marks)

The same 7-step process applies across industries β€” only the domain context changes. Below is a parallel walk-through for Healthcare, Finance, and Retail.

StepHealthcareFinanceRetail
1. Problem DefinitionPredict patient readmissions within 30 daysDetect fraudulent transactions / credit defaultsIncrease sales by understanding customer patterns
2. Data CollectionElectronic Health Records (EHR), lab results, medical images (X-Ray, MRI)Bank transaction logs, credit history, account activityCustomer purchase history, clickstream, loyalty card data
3. Data CleaningClean patient records β€” fill missing vitals, standardize ICD codesStandardize transaction formats across branches; remove duplicatesDeduplicate customer profiles, merge across channels
4. EDAIdentify disease outbreak patterns, age-disease correlationsAnomaly detection in spending behavior, peak fraud windowsDiscover customer preferences, basket affinities
5. ModelingClassifier to flag high-risk readmission patientsML model for real-time fraud detection (Random Forest, XGBoost)Churn prediction & recommendation engine
6. EvaluationValidate diagnosis model accuracy (AUC, sensitivity) on test patientsBacktest fraud-detection precision/recall on past casesCross-validate churn predictions vs actual customer drop
7. DeploymentEmbed model into hospital admission dashboard for doctorsPlug fraud-scoring API into live transaction streamDeploy churn model into CRM; trigger retention offers

Key Insight: The process is domain-agnostic β€” the value comes from translating domain expertise into the right problem definition and feature engineering at Steps 1–3.


Section 6: Foundations for Big Data Systems and Programming

PYQ: Write short note on foundation for Big Data system. (2023, 15 marks)
PYQ: What is Big Data Platform? Describe the main features of a big data platform in detail. (2022, 15 marks)
PYQ: Explain YARN. (2022, 2.5 marks)
PYQ: Write short note on Hadoop. (2024, 7.5 marks)

6.1 What Makes a Big Data System?

A Big Data system must handle the 6 V's. The foundation consists of:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Big Data System Stack        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Analytics / ML Layer               β”‚  (Spark MLlib, Hive, Pig)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Processing Layer                   β”‚  (MapReduce, Spark, Flink)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Storage Layer                      β”‚  (HDFS, S3, HBase, Cassandra)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Resource Management                β”‚  (YARN, Kubernetes)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Hardware Layer                     β”‚  (Commodity servers, cloud)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6.2 Key Concepts

Cluster Computing

A cluster is a group of connected computers (nodes) that work together as one system.

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚     Master Node      β”‚  ← manages and coordinates
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό         β–Ό         β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚Worker 1β”‚ β”‚Worker 2β”‚ β”‚Worker 3β”‚  ← do the actual work
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Each worker stores and processes a portion of the data.
  • Master node coordinates task assignment.

MapReduce Programming Model

MapReduce is the fundamental programming model for processing Big Data in parallel across a cluster.

Two phases:

MAP Phase:
  Input Data β†’ Split into chunks β†’ Each chunk processed in parallel
  β†’ Produces key-value pairs (intermediate output)

REDUCE Phase:
  Group all values by key β†’ Aggregate / combine values
  β†’ Final output

Example β€” Word Count:

Input: "cat dog cat bird dog cat"

MAP output (key-value pairs):
  (cat,1) (dog,1) (cat,1) (bird,1) (dog,1) (cat,1)

REDUCE (group by key, sum values):
  cat  β†’ 3
  dog  β†’ 2
  bird β†’ 1

Final Output:
  cat:3, dog:2, bird:1

Data Replication

  • Each piece of data is stored on multiple nodes (default: 3 copies in HDFS).
  • If one node fails, data is still available from another.
  • Provides fault tolerance.
  Data Block A
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Copy 1 β†’ Node 1                    β”‚
  β”‚  Copy 2 β†’ Node 3    (different rack)β”‚
  β”‚  Copy 3 β†’ Node 5                    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6.3 Big Data Programming Tools

ToolTypePurpose
HadoopFrameworkDistributed storage + MapReduce processing
Apache SparkProcessing engineFast in-memory distributed processing
Apache KafkaMessagingReal-time data streaming
Apache HiveQuery toolSQL-like queries on HDFS data
Apache HBaseNoSQL DBReal-time read/write on HDFS
Apache PigScriptingData transformation using Pig Latin
Apache FlinkStream processingLow-latency stream analytics
YARNResource managerCluster resource allocation

6.4 YARN β€” Yet Another Resource Negotiator

PYQ: Explain YARN. (2022, 2.5 marks)

Definition:

YARN (Yet Another Resource Negotiator) is a cluster resource manager in Hadoop. It was created by separating the processing engine from the resource-management function of classic MapReduce, and was introduced in Hadoop 2.0 to remove the Job Tracker bottleneck of Hadoop 1.x.

In short: YARN lets multiple data-processing engines (MapReduce, Spark, Flink, Tez) share the same Hadoop cluster.

YARN Architecture β€” 4 Components:

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     submits job      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  Client  β”‚ ───────────────────► β”‚  Resource Manager  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                      β”‚  (Cluster Master)  β”‚
                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              allocates        β”‚
                              containers       β–Ό
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚   Node Manager (per node)  β”‚
                              β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
                              β”‚   β”‚ MR App Master + Tasksβ”‚ β”‚
                              β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
ComponentRole
ClientSubmits MapReduce (or Spark/other) jobs to the cluster.
Resource Manager (RM)Central master β€” manages and allocates resources (CPU, memory) across the entire cluster.
Node Manager (NM)Runs on each worker machine β€” launches, monitors, and reports the status of containers.
MapReduce Application Master (AM)One per job β€” negotiates resources with RM and supervises the tasks running the MapReduce job.

Features of YARN:

  • Resource Management β€” central allocation of CPU and memory.
  • Scalability β€” supports thousands of nodes; removes the old Job Tracker bottleneck.
  • Cluster Utilization β€” multiple frameworks (MR, Spark, Tez) share resources.
  • Flexibility β€” not tied to MapReduce only; any distributed app can plug in.

Section 7: Distributed File Systems

PYQ: Write short note on DFS (Distributed File System). (2023, 2.5 marks)
PYQ: Define HDFS. Describe NameNode, DataNode and Block. Explain HDFS operations in detail. (2022, 15 marks)
PYQ: What is HDFS? Explain its components. (2023, 15 marks)

7.1 What is a Distributed File System?

Definition:

A Distributed File System (DFS) is a file system that stores data across multiple machines in a network, but makes it look like a single unified file system to the user.

Why needed?

  • A single disk cannot hold petabytes of data.
  • A distributed file system spreads data across hundreds or thousands of nodes.
  • Provides scalability, fault tolerance, and high throughput.

7.2 HDFS β€” Hadoop Distributed File System

HDFS is the most widely used distributed file system for Big Data, and the storage backbone of the Hadoop ecosystem.

Key Design Goals:

  1. Store very large files (GB to TB per file).
  2. Run on commodity hardware (cheap, standard servers).
  3. Detect and recover from hardware failures automatically.
  4. Optimized for batch processing (high throughput over low latency).

7.3 HDFS Architecture

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚        NameNode             β”‚  ← Master
          β”‚  (metadata: file locations) β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό             β–Ό              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚DataNode 1β”‚  β”‚DataNode 2β”‚  β”‚DataNode 3β”‚  ← Workers
    β”‚ Block A  β”‚  β”‚ Block B  β”‚  β”‚ Block A  β”‚
    β”‚ Block C  β”‚  β”‚ Block A  β”‚  β”‚ Block B  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two Types of Nodes:

NodeRole
NameNode (Master)Stores metadata β€” which blocks are in which DataNode. Does NOT store actual data.
DataNode (Worker)Stores actual data blocks. Reports heartbeat to NameNode.

7.4 How HDFS Works

Writing a File:

1. Client contacts NameNode β†’ "I want to write file.txt"
2. NameNode assigns block locations across DataNodes.
3. Client writes data block-by-block to DataNodes.
4. Each block is replicated to 3 DataNodes (default replication factor = 3).
5. DataNodes send confirmation.
6. NameNode updates metadata.

Reading a File:

1. Client contacts NameNode β†’ "I want to read file.txt"
2. NameNode returns list of DataNodes holding each block.
3. Client reads blocks directly from DataNodes (parallel).
4. Blocks are assembled into the complete file.

7.5 HDFS Key Concepts

Block Size

  • HDFS divides files into large fixed-size blocks (default: 128 MB in modern Hadoop).
  • A 1 GB file β†’ 8 blocks of 128 MB each.
  • Large blocks reduce NameNode metadata overhead.
  file.txt (512 MB)
  β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
  β”‚Block1β”‚Block2β”‚Block3β”‚Block4β”‚  each = 128 MB
  β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Replication

  • Each block is copied to 3 DataNodes (default replication factor = 3).
  • Strategy: 1 copy on local rack, 2 copies on different racks.
  • Protects against both node failure and rack failure.

Fault Tolerance

  • DataNodes send heartbeat signals every 3 seconds to NameNode.
  • If NameNode doesn't hear from a DataNode β†’ assumes it's dead.
  • NameNode automatically re-replicates the lost blocks from remaining copies.

7.6 HDFS Features and Limitations

Features:

βœ… Handles files of size GB to PB.
βœ… Runs on cheap commodity hardware.
βœ… Automatic fault tolerance through replication.
βœ… High throughput for batch reads.
βœ… Scales horizontally β€” just add more DataNodes.

Limitations:

❌ High latency β€” not suited for real-time (online) applications.
❌ Not good for small files (wastes NameNode metadata space).
❌ Does not support random writes β€” files are write-once, read-many.
❌ NameNode is a single point of failure (addressed by HA NameNode in Hadoop 2+).


7.7 Other Distributed File Systems

DFSByKey Features
HDFSApache/HadoopOpen source, MapReduce native
Amazon S3AWSObject storage, highly available
Google GFSGoogleInspired HDFS design
Azure BlobMicrosoftCloud object storage
CephOpen sourceBlock + object + file storage
GlusterFSRed HatNetwork-attached storage

7.8 HDFS Operations & Commands

PYQ: Define HDFS. Describe NameNode, DataNode and Block. Explain HDFS operations in detail. (2022, 15 marks)

HDFS exposes a shell-like command-line interface (hadoop fs / hdfs dfs) for users to interact with the file system. The four main operations every student must know are: Starting HDFS, Listing files, Inserting data, Retrieving data, and Shutting down.

1. Starting HDFS

Before using HDFS, the file system must be formatted (only the first time) and the daemons (NameNode + DataNodes) must be started.

# First-time only β€” format the NameNode
$ hadoop namenode -format

# Start all HDFS daemons (NameNode, DataNodes, Secondary NN)
$ start-dfs.sh

2. Listing Files in HDFS

Once HDFS is running, use the -ls command to list the contents of a directory.

$ $HADOOP_HOME/bin/hadoop fs -ls <args>

Example:

$ hadoop fs -ls /user/input

3. Inserting Data into HDFS

Loading a local file into HDFS is a 3-step process:

# Step 1 β€” Create an input directory in HDFS
$ hadoop fs -mkdir /user/input

# Step 2 β€” Transfer the local file to HDFS
$ hadoop fs -put /home/file.txt /user/input

# Step 3 β€” Verify that the file is uploaded
$ hadoop fs -ls /user/input

4. Retrieving Data from HDFS

Data can be viewed in place or copied back to the local file system.

# View file contents directly from HDFS
$ hadoop fs -cat /user/output/outfile

# Copy a file/directory from HDFS to local file system
$ hadoop fs -get /user/output/ /home/hadoop_tp/

5. Shutting Down HDFS

$ stop-dfs.sh

This stops all HDFS daemons cleanly.

Operation Summary:

OperationKey Command
Start HDFSstart-dfs.sh
List fileshadoop fs -ls
Make directoryhadoop fs -mkdir
Upload filehadoop fs -put
View filehadoop fs -cat
Download filehadoop fs -get
Stop HDFSstop-dfs.sh

Section 8: Big Data Analytics Techniques

PYQ: Define the different techniques in Big Data analytics. (2022, 15 marks)

Big Data analytics is not a single method β€” it is a toolbox of techniques drawn from statistics, machine learning, AI, and database research. The six most important techniques are described below.

8.1 Six Core Techniques

1. A/B Testing

  • Definition: A controlled experiment comparing a control group (existing version) against one or more test groups (new variants) to determine which one improves a defined objective (e.g., click-through rate, conversions).
  • Example: A retail website shows version A of a product page to 50% of visitors and version B to the other 50%. The variant with the higher purchase rate wins.

2. Data Fusion and Data Integration

  • Definition: The process of combining and analyzing data from multiple sources so that insights are richer than any single source could deliver.
  • Example: Combining smartphone GPS data, weather feeds, and social-media check-ins to map real-time traffic congestion.

3. Data Mining

  • Definition: Extracting hidden patterns and useful information from large datasets using a combination of statistics, machine learning, and database management.
  • Example: Customer segmentation β€” clustering supermarket shoppers into groups (budget, premium, bulk-buyers) based on purchase histories.

4. Machine Learning

  • Definition: A family of algorithms that make assumptions / learn patterns from data and use those patterns to make predictions on new, unseen data. ML provides predictions that would be impossible (or too slow) for human analysts.
  • Example: Netflix recommending movies, Gmail filtering spam, banks scoring credit-card fraud risk.

5. Natural Language Processing (NLP)

  • Definition: A sub-speciality of computer science, AI, and linguistics that analyzes, understands, and generates human language (text and speech).
  • Example: Sentiment analysis of tweets, chatbots, Google Translate, voice assistants (Alexa, Siri).

6. Statistics

  • Definition: The classical science of collecting, organizing, analyzing, interpreting, and presenting data from surveys and experiments.
  • Example: Computing the mean customer lifetime value, hypothesis testing for marketing campaigns, regression analysis.

8.2 Other Techniques (briefly)

  • Spatial Analysis β€” analyzing geographic / location-based data (GIS, maps).
  • Predictive Modelling β€” building statistical / ML models to forecast future outcomes.
  • Association Rule Learning β€” finding "if-then" relationships ("people who buy bread also buy butter").
  • Network Analysis β€” studying nodes and edges in graphs (social networks, fraud rings).

8.3 Summary Table

#TechniqueCore Idea
1A/B TestingControl vs Test comparison
2Data Fusion / IntegrationCombine multiple sources
3Data MiningExtract patterns from large data
4Machine LearningAlgorithms that learn from data
5NLPUnderstand human language
6StatisticsCollect & interpret data

Section 9: Big Data Platform β€” Main Features

PYQ: What is Big Data Platform? Describe the main features of a big data platform in detail. (2022, 15 marks)

9.1 What is a Big Data Platform?

A Big Data Platform is an integrated computing solution that combines software, tools, and hardware to manage the full Big Data lifecycle β€” ingestion, storage, processing, analysis, and visualization β€” at petabyte scale.

Examples: Cloudera CDP, Hortonworks HDP, AWS EMR, Google BigQuery, Azure HDInsight.

9.2 Five Main Features (Characteristics)

   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚       Big Data Platform β€” 5 Features    β”‚
   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
   β”‚  1. Comprehensive                       β”‚
   β”‚  2. Integrated                          β”‚
   β”‚  3. Scalability                         β”‚
   β”‚  4. Extensible                          β”‚
   β”‚  5. Minimum Maintenance                 β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Comprehensive

  • Handles all the Big Data challenges β€” Volume, Variety, and Velocity β€” within a single unified stack.
  • Supports structured, semi-structured, and unstructured data, plus batch and stream pipelines.

2. Integrated

  • Works alongside existing information architecture β€” traditional databases, data warehouses, and BI tools (Teradata, Tableau, Power BI, Informatica).
  • Avoids data silos by exposing Big Data results back to enterprise reporting layers.

3. Scalability

  • Designed to scale from gigabytes to petabytes without redesign β€” this is the role Hadoop plays.
  • Storage must be cost-efficient (commodity disks, object storage) so scaling does not break budgets.

4. Extensible

  • Easy to accommodate newer technologies as the company expands β€” new processing engines (Spark, Flink), new sources, new ML libraries.
  • Naturally integrates IoT and sensor data into the same platform.

5. Minimum Maintenance

  • Fault-tolerant design β€” failure of an individual node does not stop the platform.
  • Individual hardware can be upgraded or replaced without downtime.
  • Self-healing replication and automatic re-balancing.

9.3 Summary Table

FeatureWhy it matters
ComprehensiveSolves V-V-V challenges in one stack
IntegratedPlays well with existing DW / BI
ScalabilityPetabyte growth without redesign
ExtensibleAccepts new tech (IoT, ML, streaming)
Minimum MaintenanceFault tolerant, hot-swappable hardware

Section 10: Types of Data (Measurement Scales)

PYQ: Discuss the following in detail: (i) Challenges in big data (ii) Types of Data. (2022, 15 marks)

10.1 How Data is Classified

Data is classified into measurement scales based on three characteristics:

CharacteristicQuestion it answers
OrderDoes the order of observations matter?
DistanceAre the distances (intervals) between observations meaningful?
True ZeroIs there a real, absolute zero point?

Based on these three properties, data falls into four measurement scales: Nominal, Ordinal, Interval, and Ratio.

   Nominal  β†’  Ordinal  β†’  Interval  β†’  Ratio
   (labels)   (rank)      (+ equal     (+ true zero
                            spacing)      β†’ ratios)

10.2 The Four Measurement Scales

1. Nominal Scale

  • Definition: Categories or labels with no inherent order between them.
  • Examples: Gender (Male/Female/Other), ethnicity, product type, blood group.
  • Properties:
    • Order: Doesn't matter.
    • Distance: Not held (cannot subtract one category from another).
    • True Zero: None.
  • Allowed operations: Counting, mode, chi-square tests.

2. Ordinal Scale

  • Definition: Categories with a natural order or ranking, but the intervals between ranks are unequal / unknown.
  • Examples: Customer ratings (Poor / Fair / Good / Excellent), survey responses (Strongly Disagree β†’ Strongly Agree), military ranks, education level.
  • Properties:
    • Order: Matters.
    • Distance: Not held (the gap between "Good" and "Excellent" is not equal to "Poor" to "Fair").
    • True Zero: None.
  • Allowed operations: Median, percentiles, rank correlation.

3. Interval Scale

  • Definition: Numerical data with consistent (equal) intervals between values, but no true zero point.
  • Examples: Temperature in Celsius or Fahrenheit, calendar years, IQ scores.
  • Properties:
    • Order: Matters.
    • Distance: Held β€” the gap between 20Β°C and 30Β°C equals the gap between 30Β°C and 40Β°C.
    • True Zero: None β€” 0Β°C does not mean "no temperature" (it's rescalable; 0Β°C β‰  0Β°F).
  • Allowed operations: Addition, subtraction, mean, standard deviation. Ratios are NOT meaningful (40Β°C is not "twice as hot" as 20Β°C).

4. Ratio Scale

  • Definition: Numerical data with equal intervals AND a true zero point, making ratios meaningful.
  • Examples: Weight, height, age, income, distance, time elapsed, temperature in Kelvin.
  • Properties:
    • Order: Matters.
    • Distance: Held.
    • True Zero: Present β€” 0 kg means no weight; β‚Ή0 means no income.
  • Allowed operations: All arithmetic β€” including ratios ("60 kg is twice 30 kg").

10.3 Summary Comparison Table

ScaleOrderDistanceTrue ZeroExampleKey Stat
Nominal✘✘✘Gender, blood groupMode, count
Ordinalβœ”βœ˜βœ˜Ratings, ranksMedian
Intervalβœ”βœ”βœ˜Celsius, IQMean, SD
Ratioβœ”βœ”βœ”Weight, incomeAll operations

Rule of thumb: Each scale includes the properties of the previous one β€” Ratio is the most powerful, Nominal the most limited.


Section 11: Hadoop β€” Detailed Note

PYQ: Write short note on Hadoop. (2024, 7.5 marks)

11.1 Definition

Hadoop is an open-source Java framework developed by the Apache Software Foundation for storing massive amounts of data and processing it in parallel across clusters of commodity hardware. It is the de-facto standard for Big Data infrastructure.

Designed to be:

  • Distributed β€” runs on hundreds/thousands of nodes.
  • Fault-tolerant β€” survives hardware failures.
  • Scalable β€” add more nodes to handle more data.
  • Cost-effective β€” uses cheap commodity servers.

11.2 Three Core Components

        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚             HADOOP               β”‚
        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
        β”‚   HDFS    β”‚ MapReduceβ”‚    YARN   β”‚
        β”‚ (Storage) β”‚(Processing)β”‚(Resource)β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. HDFS (Hadoop Distributed File System) β€” Storage

  • Distributed Storage β€” splits files into blocks and stores them across many DataNodes.
  • Scalability β€” add DataNodes to grow capacity linearly.
  • Data Replication β€” each block is replicated (default = 3 copies) for fault tolerance.
  • Data Compression β€” supports codecs (Snappy, Gzip, LZO) to reduce storage and I/O.

2. MapReduce β€” Processing

A parallel programming model that processes data in three phases:

  Map  β†’  Shuffle / Sort  β†’  Reduce
  ───────────────────────────────────
  Map:     Split data and emit (key, value) pairs
  Shuffle: Group all values by key across nodes
  Reduce:  Aggregate / combine values for final output

3. YARN (Yet Another Resource Negotiator) β€” Resource Management

  • Resource Management β€” central CPU and memory allocation across the cluster.
  • Scalability β€” supports thousands of nodes (replaces Hadoop-1 Job Tracker bottleneck).
  • Cluster Utilization β€” multiple engines (MR, Spark, Tez, Flink) share the same cluster.
  • Flexibility β€” supports varied workloads, not just MapReduce.

11.3 Hadoop in One Picture

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                   HADOOP CLUSTER                 β”‚
  β”‚                                                  β”‚
  β”‚   Storage  ─►  HDFS  (blocks + 3x replication)   β”‚
  β”‚   Compute  ─►  MapReduce / Spark / Tez           β”‚
  β”‚   Manage   ─►  YARN (RM + NM + AM)               β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Revision Points

Big Data β€” Why & Where:

  • Data explosion from social media, IoT, digitization, e-commerce.
  • Sources: social, sensor, transactional, web, scientific.

6 V's:

VNameKey Point
V1VolumePetabytes of data
V2VelocityReal-time streaming
V3VarietyStructured / semi / unstructured
V4VeracityData quality and accuracy
V5ValueBusiness insight extraction
V6VariabilityContext-dependent meaning

Data Science Process (7 Steps):

  1. Problem Definition β†’ 2. Data Collection β†’ 3. Data Preparation β†’ 4. EDA β†’ 5. Model Building β†’ 6. Evaluation β†’ 7. Deployment

Scaling:

  • Vertical = bigger machine (limited).
  • Horizontal = more machines (Big Data approach).

HDFS:

  • NameNode = master (metadata only).
  • DataNode = worker (stores blocks).
  • Block size = 128 MB (default).
  • Replication factor = 3.
  • Fault tolerant via heartbeat + re-replication.

MapReduce:

Map β†’ (key, value) pairs β†’ Shuffle & Sort β†’ Reduce β†’ Final output

YARN (4 components):

  • Client β†’ Resource Manager β†’ Node Manager β†’ MR Application Master.
  • Features: Resource Management, Scalability, Cluster Utilization, Flexibility.

HDFS Operations (commands):

  • Start: hadoop namenode -format β†’ start-dfs.sh
  • List: hadoop fs -ls
  • Insert: -mkdir β†’ -put /home/file.txt /user/input β†’ -ls
  • Retrieve: -cat /user/output/outfile / -get /user/output/ /home/hadoop_tp/
  • Stop: stop-dfs.sh

Big Data Analytics Techniques (6):

A/B Testing Β· Data Fusion / Integration Β· Data Mining Β· Machine Learning Β· NLP Β· Statistics. Other: Spatial analysis, Predictive modelling, Association rule learning, Network analysis.

Big Data Platform β€” 5 Features:

  1. Comprehensive Β· 2. Integrated Β· 3. Scalability Β· 4. Extensible Β· 5. Minimum Maintenance.

Types of Data β€” 4 Measurement Scales:

ScaleOrderDistanceTrue ZeroExample
Nominal✘✘✘Gender
Ordinalβœ”βœ˜βœ˜Ratings
Intervalβœ”βœ”βœ˜Celsius
Ratioβœ”βœ”βœ”Weight

Hadoop (3 components):

  • HDFS β€” Distributed Storage + Scalability + 3-copy Replication + Compression.
  • MapReduce β€” Map β†’ Shuffle/Sort β†’ Reduce.
  • YARN β€” Resource Management + Scalability + Cluster Utilization + Flexibility.

Data Science Process β€” Domain Examples:

Healthcare (readmission) Β· Finance (fraud) Β· Retail (churn) β€” same 7 steps, different domain context.


Expected Exam Questions

15-Mark Questions:

  1. What is Big Data? Explain various characteristics, challenges and applications of Big Data. (2023)
  2. Explain the characteristics of Big Data and discuss how they contribute to the challenges in managing large volumes of data. (2024)
  3. Explain six V's of Big Data in detail. (2023)
  4. Define HDFS. Describe NameNode, DataNode and Block. Explain HDFS operations (start, list, insert, retrieve, stop) with commands in detail. (2022)
  5. What is HDFS? Explain its components. (2023)
  6. What is Big Data Platform? Describe the main features (Comprehensive, Integrated, Scalability, Extensible, Minimum Maintenance) of a big data platform in detail. (2022)
  7. Write short note on foundation for Big Data system. (2023)
  8. Define the different techniques in Big Data analytics (A/B Testing, Data Fusion, Data Mining, ML, NLP, Statistics). (2022)
  9. Discuss the following in detail: (i) Challenges in big data (ii) Types of Data (Nominal, Ordinal, Interval, Ratio measurement scales). (2022)
  10. Describe any five real life applications of Big Data. (2022)
  11. Write short note on real life applications of Big Data. (2023)

Mixed (8 + 7 marks):

  1. Describe the steps involved in the Data Science process. How does each step contribute to extracting value from Big Data? (2024, 8 marks)
  2. Illustrate with a real-world scenario (Healthcare / Finance / Retail) for steps involving Data Science process. (2024, 7 marks)

Short Answer Questions (2.5 marks):

  1. Explain Big Data. (2022)
  2. Explain YARN β€” its 4 components (Client, Resource Manager, Node Manager, MR Application Master) and features. (2022)
  3. Six V in Big Data. (2022)
  4. Briefly elaborate the six V of Big Data. (2024)
  5. Write short note on Data Sciences. (2023)
  6. Write short note on DFS. (2023)
  7. Enlist 5 challenges associated with managing and analyzing large volumes of data. (2024)

Short Note (β‰ˆ 7.5 marks):

  1. Write short note on Hadoop β€” covering HDFS (Storage + Replication + Compression), MapReduce (Map β†’ Shuffle β†’ Reduce), and YARN (Resource Manager). (2024)

These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!