BDASemester 8

Unit 1: Introduction to Big Data

Why and Where Big Data, Applications and Challenges, Characteristics (6 V's), Dimensions of Scalability, Data Science process, Foundations of Big Data Systems, and Distributed File Systems.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Introduction to Big Data: Big Data: Why and Where, Application and Challenges, Characteristics of Big Data and Dimensions of Scalability, The Six V, Data Science: Getting Value out of Big Data, Steps in the Data science process, Foundations for Big Data Systems and Programming, Distributed file systems.


🎯 PYQ Analysis for Unit 1

PYQs will be added after analysis β€” check back soon.


Section 1: Big Data β€” Why and Where

1.1 What is Big Data?

Definition:

Big Data refers to extremely large and complex datasets that cannot be processed, stored, or analyzed using traditional data management tools (like regular databases or spreadsheets) within a reasonable time.

The term "Big Data" describes data that is:

  • Too large to fit on one machine.
  • Generated too fast to process in real time with traditional systems.
  • Too varied in format to fit in a single table.

Simple Definition: Big Data is data that is so big, fast, or complex that traditional methods can't handle it.

Real-Life Scale:

Every day:
  Google processes  β†’ 8.5 billion searches
  YouTube receives  β†’ 500 hours of video per minute uploaded
  Facebook generates→ 4 petabytes of data
  Twitter sees      β†’ 500 million tweets
  WhatsApp sends    β†’ 100 billion messages

1.2 Why Big Data? (The Need)

Data has always existed, but several forces have exploded the volume beyond what old tools can handle:

1. Digitization of Everything:

  • Everything is now recorded digitally β€” transactions, clicks, GPS location, health sensors.
  • Every smartphone is a data generator.

2. Internet and Social Media:

  • Billions of people use the internet every day.
  • Each like, comment, post, search, and purchase generates data.

3. IoT (Internet of Things):

  • Smart devices (sensors, wearables, smart appliances) continuously stream data.
  • Example: A smart factory has thousands of sensors generating data every second.

4. E-commerce and Digital Transactions:

  • Every purchase, product view, and abandoned cart is recorded.
  • Amazon processes millions of transactions per day.

5. Healthcare and Science:

  • Genomics, medical imaging, clinical trials generate enormous datasets.
  • A single human genome = ~3 GB of data.

6. Cheaper Storage and Computing:

  • Cost of storing 1 GB dropped from thousands of dollars (1980) to fractions of a cent (now).
  • Cloud computing made large-scale processing affordable.

1.3 Where Does Big Data Come From? (Sources)

Data Sources:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚       Big Data Sources   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό              β–Ό                 β–Ό              β–Ό
   Social Media    Machine / IoT    Transactional    Web / Logs
   ───────────    ─────────────    ─────────────    ──────────
   Facebook       Sensors          Bank records     Clickstreams
   Twitter        GPS devices      E-commerce       Server logs
   Instagram      Smart meters     Healthcare       Search queries
   LinkedIn       RFID tags        Insurance        App usage

Categories of Data Sources:

Source TypeExamplesData Type
Social MediaFacebook, Twitter, YouTubeText, images, video
Machine / SensorIoT devices, GPS, RFIDNumeric streams
TransactionBanks, e-commerce, hospitalsStructured records
WebLogs, click-streams, searchSemi-structured
ScientificGenomics, astronomy, climateNumeric, images
EnterpriseERP, CRM, supply chainStructured

Section 2: Applications and Challenges

2.1 Applications of Big Data

1. Healthcare:

  • Disease prediction β€” analyzing patient history to predict illness.
  • Drug discovery β€” finding drug interactions using genetic data.
  • Epidemic tracking β€” monitoring disease spread (COVID-19 dashboards).
  • Personalized medicine β€” treatment based on individual genome.

2. Finance and Banking:

  • Fraud detection β€” real-time anomaly detection on credit card transactions.
  • Risk assessment β€” evaluating loan default probability.
  • Algorithmic trading β€” high-frequency trading decisions from market data.
  • Customer 360 β€” full customer profile from all touchpoints.

3. Retail and E-commerce:

  • Recommendation engines β€” Amazon/Netflix "you may also like".
  • Demand forecasting β€” predicting inventory needs.
  • Price optimization β€” dynamic pricing based on demand.
  • Customer sentiment β€” analyzing reviews and social media.

4. Transportation:

  • Traffic management β€” real-time rerouting (Google Maps).
  • Predictive maintenance β€” airlines predicting engine failures before they happen.
  • Ride-sharing optimization β€” Uber/Ola surge pricing and driver allocation.
  • Self-driving cars β€” processing sensor data in real time.

5. Government and Smart Cities:

  • Public safety β€” crime hotspot prediction.
  • Energy management β€” smart grid optimization.
  • Citizen services β€” tax fraud detection, welfare eligibility.
  • Disaster response β€” resource allocation using real-time data.

6. Telecommunications:

  • Churn prediction β€” identify customers likely to switch providers.
  • Network optimization β€” identifying bottlenecks using call data records.
  • Personalized plans β€” usage-based plan recommendations.

7. Manufacturing:

  • Predictive maintenance β€” prevent machine breakdowns using sensor data.
  • Quality control β€” defect detection using computer vision.
  • Supply chain optimization β€” reducing delays and wastage.

8. Agriculture:

  • Crop yield prediction using satellite + weather data.
  • Precision farming β€” targeted irrigation and fertilization.
  • Disease detection β€” identifying crop disease from drone imagery.

2.2 Challenges of Big Data

Despite its potential, Big Data comes with significant challenges:

1. Storage:

  • Petabytes and exabytes of data need massive, scalable storage.
  • Traditional RDBMS cannot store semi-structured or unstructured data.

2. Processing Speed:

  • Data arrives faster than it can be processed (streaming data).
  • Batch processing too slow for real-time use cases.

3. Data Variety:

  • Data comes in structured, semi-structured, and unstructured formats.
  • Integrating text, images, video, JSON, XML, CSV is complex.

4. Data Quality:

  • Raw big data is often dirty β€” missing values, duplicates, inconsistencies.
  • "Garbage In, Garbage Out" β€” poor quality input = poor insights.

5. Privacy and Security:

  • Handling sensitive data (medical, financial) requires strict compliance.
  • Risk of data breaches at scale.
  • GDPR, HIPAA, and other regulations must be followed.

6. Talent Gap:

  • Shortage of skilled data engineers, data scientists, and architects.

7. Cost:

  • Infrastructure (servers, cloud, bandwidth) is expensive.
  • ROI must justify the investment.

8. Data Governance:

  • Deciding who owns data, who can access it, and how long to keep it.

Section 3: Characteristics of Big Data β€” The 6 V's

3.1 What are the V's of Big Data?

The characteristics of Big Data are described using V's. Originally there were 3 V's (Gartner, 2001), which expanded to 6 V's over time.

                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚  BIG DATA   β”‚
                     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β–Ό          β–Ό       β–Ό      β–Ό     β–Ό     β–Ό
      Volume    Velocity  Variety Veracity Value Variability

3.2 The 6 V's β€” Detailed

V1 β€” Volume

Definition: The amount / size of data being generated and stored.

  • We have moved from gigabytes β†’ terabytes β†’ petabytes β†’ exabytes.
  • Traditional systems crash under this load.

Scale Reference:

1 KB  = 1,000 bytes
1 MB  = 1,000 KB     (one song)
1 GB  = 1,000 MB     (a movie)
1 TB  = 1,000 GB     (a library of books)
1 PB  = 1,000 TB     (Facebook stores ~100 PB)
1 EB  = 1,000 PB     (all internet traffic per month)
1 ZB  = 1,000 EB     (total global data in 2020 β‰ˆ 40 ZB)

Challenge: Where to store this? β†’ Distributed File Systems (HDFS)


V2 β€” Velocity

Definition: The speed at which data is generated and must be processed.

  • Twitter generates ~6,000 tweets per second.
  • Stock exchange processes millions of trades per second.
  • IoT devices stream data continuously.

Types of Processing:

Batch Processing:   Collect data β†’ Process later (hours/days)
                    Example: Monthly billing reports

Stream Processing:  Process data as it arrives (milliseconds)
                    Example: Fraud detection, live traffic

Challenge: How to process fast enough? β†’ Apache Kafka, Apache Storm, Spark Streaming


V3 β€” Variety

Definition: The diversity of data types and formats in Big Data.

Three Types of Data:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Structured     β”‚ Semi-Structured   β”‚   Unstructured      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Fixed schema     β”‚ Flexible schema   β”‚ No fixed schema     β”‚
β”‚ Rows & columns   β”‚ Tags / key-value  β”‚ Free-form           β”‚
β”‚ SQL databases    β”‚ JSON, XML, CSV    β”‚ Text, images, video β”‚
β”‚ Example:         β”‚ Example:          β”‚ Example:            β”‚
β”‚ Bank records     β”‚ Twitter API data  β”‚ Medical images      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Challenge: How to store and query all formats? β†’ NoSQL databases, Data Lakes


V4 β€” Veracity

Definition: The quality, accuracy, and trustworthiness of the data.

  • Big Data is often noisy, incomplete, biased, or inconsistent.
  • Wrong data leads to wrong conclusions β†’ dangerous in healthcare or finance.

Sources of bad veracity:

  • Sensor malfunction β†’ wrong readings.
  • User input errors β†’ misspelled names, wrong dates.
  • Social media noise β†’ sarcasm, fake news, bots.
  • Missing values β†’ incomplete records.

Challenge: How to ensure data quality? β†’ Data cleaning, validation pipelines, data governance


V5 β€” Value

Definition: The usefulness or business benefit derived from analyzing the data.

  • Not all big data is valuable. Most raw data has low value density.
  • The goal is to extract high-value insights from low-value-density data.

Value Chain:

Raw Data β†’ Processed Data β†’ Information β†’ Knowledge β†’ Wisdom β†’ Business Value

Example: A billion web clicks have low individual value, but analyzed together they reveal purchasing patterns worth millions in targeted ads.

Challenge: How to extract value efficiently? β†’ ML models, analytics platforms, BI tools


V6 β€” Variability

Definition: The inconsistency in data β€” same data meaning different things at different times or contexts.

  • A word like "bank" can mean a financial institution or a river bank.
  • Sentiment of a tweet depends on context and time.
  • Data formats change over time.

Difference from Variety:

  • Variety = different types of data (text vs video).
  • Variability = the same type of data having inconsistent meaning or format.

Challenge: Context-aware processing, NLP, semantic analysis.


3.3 Summary Table β€” The 6 V's

VNameQuestion it AnswersChallenge
V1VolumeHow much?Storage, scalability
V2VelocityHow fast?Real-time processing
V3VarietyWhat types?Integration, NoSQL
V4VeracityHow accurate?Data quality, cleaning
V5ValueHow useful?Insight extraction
V6VariabilityHow consistent?Context, semantics

Section 4: Dimensions of Scalability

4.1 What is Scalability?

Scalability is the ability of a system to handle growing amounts of work (more data, more users, more requests) by adding resources.

4.2 Two Types of Scaling

Vertical Scaling (Scale Up)

Before:           After:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Server  β”‚  ──►  β”‚ Bigger Serverβ”‚
β”‚ 8 GB    β”‚       β”‚ 64 GB RAM    β”‚
β”‚ 4 cores β”‚       β”‚ 32 cores     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Add more CPU, RAM, or disk to a single machine.
  • Simple but has physical limits.
  • Expensive at high end.
  • Single point of failure.

Horizontal Scaling (Scale Out)

Before:          After:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”
β”‚ Server  β”‚ ──►  β”‚ S1  β”‚ β”‚ S2  β”‚ β”‚ S3  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜
  • Add more machines (nodes) to a cluster.
  • Big Data systems use horizontal scaling (commodity hardware).
  • Basis of Hadoop, Spark, NoSQL.
VerticalHorizontal
HowBigger machineMore machines
CostHigh per unitLow (commodity)
LimitPhysical hardware limitVirtually unlimited
FailureSingle point of failureFault tolerant
Big Data fitPoorExcellent

4.3 Dimensions of Scalability in Big Data

DimensionWhat scales?Example
Data VolumeAmount of stored dataAdding more HDFS nodes
ThroughputData processed per secondMore Spark workers
Query SpeedResponse time for queriesPartitioning, indexing
ConcurrencySimultaneous users/jobsLoad balancing
GeographicMultiple data centersCloud regions (AWS)

Section 5: Data Science β€” Getting Value out of Big Data

5.1 What is Data Science?

Definition:

Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.

It combines:

  • Statistics β€” mathematical analysis
  • Computer Science β€” programming and algorithms
  • Domain Knowledge β€” understanding of the specific field

Data Science vs Big Data:

Big DataData Science
FocusInfrastructure to store and process large dataMethods to extract insights from data
ToolsHadoop, Spark, HDFSPython, R, ML, Statistics
RoleData EngineerData Scientist
OutputScalable pipelinesActionable insights

5.2 Steps in the Data Science Process

The data science process is a structured workflow for turning raw data into valuable insights.

Overview:

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 1. Problem      β”‚
  β”‚    Definition   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 2. Data         β”‚
  β”‚    Collection   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 3. Data         β”‚
  β”‚    Preparation  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 4. Exploratory  β”‚
  β”‚    Analysis     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 5. Model        β”‚
  β”‚    Building     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 6. Evaluation   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 7. Deployment   β”‚
  β”‚  & Storytelling β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

5.3 Each Step Explained

Step 1: Problem Definition

  • Clearly define the business question to answer.
  • Example: "Which customers are most likely to churn next month?"
  • Bad-defined problems = wasted effort.

Step 2: Data Collection

  • Identify and gather data from relevant sources.
  • Sources: databases, APIs, web scraping, sensors, surveys.
  • Key question: Is there enough data to answer the question?

Step 3: Data Preparation (Wrangling / Cleaning)

  • Real data is messy. This is the most time-consuming step (60–80% of effort).
  • Tasks:
    • Handle missing values.
    • Remove duplicates.
    • Fix format inconsistencies.
    • Encode categorical variables.
    • Normalize / scale features.

Step 4: Exploratory Data Analysis (EDA)

  • Understand the data before building models.
  • Tasks:
    • Summary statistics (mean, median, std).
    • Data visualization (histograms, scatter plots, heatmaps).
    • Find correlations and outliers.
    • Choose appropriate ML techniques.

Step 5: Model Building

  • Apply ML or statistical algorithms to the prepared data.
  • Split data: training set + test set.
  • Try multiple models, tune hyperparameters.
  • Example: Decision tree, regression, clustering.

Step 6: Evaluation

  • Measure how well the model performs on unseen data.
  • Metrics: Accuracy, F1-score, RMSE, AUC (covered in ML Unit 4).
  • If performance is poor β†’ go back to Step 3 or 5.

Step 7: Deployment and Communication

  • Deploy the model in a production system (web API, dashboard, app).
  • Communicate findings to stakeholders using visualizations and storytelling.
  • Monitor model over time for degradation.

Section 6: Foundations for Big Data Systems and Programming

6.1 What Makes a Big Data System?

A Big Data system must handle the 6 V's. The foundation consists of:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        Big Data System Stack        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Analytics / ML Layer               β”‚  (Spark MLlib, Hive, Pig)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Processing Layer                   β”‚  (MapReduce, Spark, Flink)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Storage Layer                      β”‚  (HDFS, S3, HBase, Cassandra)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Resource Management                β”‚  (YARN, Kubernetes)
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Hardware Layer                     β”‚  (Commodity servers, cloud)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6.2 Key Concepts

Cluster Computing

A cluster is a group of connected computers (nodes) that work together as one system.

       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚     Master Node      β”‚  ← manages and coordinates
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό         β–Ό         β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚Worker 1β”‚ β”‚Worker 2β”‚ β”‚Worker 3β”‚  ← do the actual work
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • Each worker stores and processes a portion of the data.
  • Master node coordinates task assignment.

MapReduce Programming Model

MapReduce is the fundamental programming model for processing Big Data in parallel across a cluster.

Two phases:

MAP Phase:
  Input Data β†’ Split into chunks β†’ Each chunk processed in parallel
  β†’ Produces key-value pairs (intermediate output)

REDUCE Phase:
  Group all values by key β†’ Aggregate / combine values
  β†’ Final output

Example β€” Word Count:

Input: "cat dog cat bird dog cat"

MAP output (key-value pairs):
  (cat,1) (dog,1) (cat,1) (bird,1) (dog,1) (cat,1)

REDUCE (group by key, sum values):
  cat  β†’ 3
  dog  β†’ 2
  bird β†’ 1

Final Output:
  cat:3, dog:2, bird:1

Data Replication

  • Each piece of data is stored on multiple nodes (default: 3 copies in HDFS).
  • If one node fails, data is still available from another.
  • Provides fault tolerance.
  Data Block A
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Copy 1 β†’ Node 1                    β”‚
  β”‚  Copy 2 β†’ Node 3    (different rack)β”‚
  β”‚  Copy 3 β†’ Node 5                    β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6.3 Big Data Programming Tools

ToolTypePurpose
HadoopFrameworkDistributed storage + MapReduce processing
Apache SparkProcessing engineFast in-memory distributed processing
Apache KafkaMessagingReal-time data streaming
Apache HiveQuery toolSQL-like queries on HDFS data
Apache HBaseNoSQL DBReal-time read/write on HDFS
Apache PigScriptingData transformation using Pig Latin
Apache FlinkStream processingLow-latency stream analytics
YARNResource managerCluster resource allocation

Section 7: Distributed File Systems

7.1 What is a Distributed File System?

Definition:

A Distributed File System (DFS) is a file system that stores data across multiple machines in a network, but makes it look like a single unified file system to the user.

Why needed?

  • A single disk cannot hold petabytes of data.
  • A distributed file system spreads data across hundreds or thousands of nodes.
  • Provides scalability, fault tolerance, and high throughput.

7.2 HDFS β€” Hadoop Distributed File System

HDFS is the most widely used distributed file system for Big Data, and the storage backbone of the Hadoop ecosystem.

Key Design Goals:

  1. Store very large files (GB to TB per file).
  2. Run on commodity hardware (cheap, standard servers).
  3. Detect and recover from hardware failures automatically.
  4. Optimized for batch processing (high throughput over low latency).

7.3 HDFS Architecture

          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β”‚        NameNode             β”‚  ← Master
          β”‚  (metadata: file locations) β”‚
          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό             β–Ό              β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚DataNode 1β”‚  β”‚DataNode 2β”‚  β”‚DataNode 3β”‚  ← Workers
    β”‚ Block A  β”‚  β”‚ Block B  β”‚  β”‚ Block A  β”‚
    β”‚ Block C  β”‚  β”‚ Block A  β”‚  β”‚ Block B  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Two Types of Nodes:

NodeRole
NameNode (Master)Stores metadata β€” which blocks are in which DataNode. Does NOT store actual data.
DataNode (Worker)Stores actual data blocks. Reports heartbeat to NameNode.

7.4 How HDFS Works

Writing a File:

1. Client contacts NameNode β†’ "I want to write file.txt"
2. NameNode assigns block locations across DataNodes.
3. Client writes data block-by-block to DataNodes.
4. Each block is replicated to 3 DataNodes (default replication factor = 3).
5. DataNodes send confirmation.
6. NameNode updates metadata.

Reading a File:

1. Client contacts NameNode β†’ "I want to read file.txt"
2. NameNode returns list of DataNodes holding each block.
3. Client reads blocks directly from DataNodes (parallel).
4. Blocks are assembled into the complete file.

7.5 HDFS Key Concepts

Block Size

  • HDFS divides files into large fixed-size blocks (default: 128 MB in modern Hadoop).
  • A 1 GB file β†’ 8 blocks of 128 MB each.
  • Large blocks reduce NameNode metadata overhead.
  file.txt (512 MB)
  β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”
  β”‚Block1β”‚Block2β”‚Block3β”‚Block4β”‚  each = 128 MB
  β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Replication

  • Each block is copied to 3 DataNodes (default replication factor = 3).
  • Strategy: 1 copy on local rack, 2 copies on different racks.
  • Protects against both node failure and rack failure.

Fault Tolerance

  • DataNodes send heartbeat signals every 3 seconds to NameNode.
  • If NameNode doesn't hear from a DataNode β†’ assumes it's dead.
  • NameNode automatically re-replicates the lost blocks from remaining copies.

7.6 HDFS Features and Limitations

Features:

βœ… Handles files of size GB to PB.
βœ… Runs on cheap commodity hardware.
βœ… Automatic fault tolerance through replication.
βœ… High throughput for batch reads.
βœ… Scales horizontally β€” just add more DataNodes.

Limitations:

❌ High latency β€” not suited for real-time (online) applications.
❌ Not good for small files (wastes NameNode metadata space).
❌ Does not support random writes β€” files are write-once, read-many.
❌ NameNode is a single point of failure (addressed by HA NameNode in Hadoop 2+).


7.7 Other Distributed File Systems

DFSByKey Features
HDFSApache/HadoopOpen source, MapReduce native
Amazon S3AWSObject storage, highly available
Google GFSGoogleInspired HDFS design
Azure BlobMicrosoftCloud object storage
CephOpen sourceBlock + object + file storage
GlusterFSRed HatNetwork-attached storage

Quick Revision Points

Big Data β€” Why & Where:

  • Data explosion from social media, IoT, digitization, e-commerce.
  • Sources: social, sensor, transactional, web, scientific.

6 V's:

VNameKey Point
V1VolumePetabytes of data
V2VelocityReal-time streaming
V3VarietyStructured / semi / unstructured
V4VeracityData quality and accuracy
V5ValueBusiness insight extraction
V6VariabilityContext-dependent meaning

Data Science Process (7 Steps):

  1. Problem Definition β†’ 2. Data Collection β†’ 3. Data Preparation β†’ 4. EDA β†’ 5. Model Building β†’ 6. Evaluation β†’ 7. Deployment

Scaling:

  • Vertical = bigger machine (limited).
  • Horizontal = more machines (Big Data approach).

HDFS:

  • NameNode = master (metadata only).
  • DataNode = worker (stores blocks).
  • Block size = 128 MB (default).
  • Replication factor = 3.
  • Fault tolerant via heartbeat + re-replication.

MapReduce:

Map β†’ (key, value) pairs β†’ Shuffle & Sort β†’ Reduce β†’ Final output

Expected Exam Questions

PYQs will be added after analysis β€” check back soon.


These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!