Syllabus:
Introduction to Big Data: Big Data: Why and Where, Application and Challenges, Characteristics of Big Data and Dimensions of Scalability, The Six V, Data Science: Getting Value out of Big Data, Steps in the Data science process, Foundations for Big Data Systems and Programming, Distributed file systems.
π― PYQ Analysis for Unit 1
PYQs will be added after analysis β check back soon.
Section 1: Big Data β Why and Where
1.1 What is Big Data?
Definition:
Big Data refers to extremely large and complex datasets that cannot be processed, stored, or analyzed using traditional data management tools (like regular databases or spreadsheets) within a reasonable time.
The term "Big Data" describes data that is:
- Too large to fit on one machine.
- Generated too fast to process in real time with traditional systems.
- Too varied in format to fit in a single table.
Simple Definition: Big Data is data that is so big, fast, or complex that traditional methods can't handle it.
Real-Life Scale:
Every day:
Google processes β 8.5 billion searches
YouTube receives β 500 hours of video per minute uploaded
Facebook generatesβ 4 petabytes of data
Twitter sees β 500 million tweets
WhatsApp sends β 100 billion messages
1.2 Why Big Data? (The Need)
Data has always existed, but several forces have exploded the volume beyond what old tools can handle:
1. Digitization of Everything:
- Everything is now recorded digitally β transactions, clicks, GPS location, health sensors.
- Every smartphone is a data generator.
2. Internet and Social Media:
- Billions of people use the internet every day.
- Each like, comment, post, search, and purchase generates data.
3. IoT (Internet of Things):
- Smart devices (sensors, wearables, smart appliances) continuously stream data.
- Example: A smart factory has thousands of sensors generating data every second.
4. E-commerce and Digital Transactions:
- Every purchase, product view, and abandoned cart is recorded.
- Amazon processes millions of transactions per day.
5. Healthcare and Science:
- Genomics, medical imaging, clinical trials generate enormous datasets.
- A single human genome = ~3 GB of data.
6. Cheaper Storage and Computing:
- Cost of storing 1 GB dropped from thousands of dollars (1980) to fractions of a cent (now).
- Cloud computing made large-scale processing affordable.
1.3 Where Does Big Data Come From? (Sources)
Data Sources:
ββββββββββββββββββββββββββββ
β Big Data Sources β
ββββββββββββββ¬ββββββββββββββ
ββββββββββββββββ¬βββββββββ΄βββββββββ¬βββββββββββββββ
βΌ βΌ βΌ βΌ
Social Media Machine / IoT Transactional Web / Logs
βββββββββββ βββββββββββββ βββββββββββββ ββββββββββ
Facebook Sensors Bank records Clickstreams
Twitter GPS devices E-commerce Server logs
Instagram Smart meters Healthcare Search queries
LinkedIn RFID tags Insurance App usage
Categories of Data Sources:
| Source Type | Examples | Data Type |
|---|---|---|
| Social Media | Facebook, Twitter, YouTube | Text, images, video |
| Machine / Sensor | IoT devices, GPS, RFID | Numeric streams |
| Transaction | Banks, e-commerce, hospitals | Structured records |
| Web | Logs, click-streams, search | Semi-structured |
| Scientific | Genomics, astronomy, climate | Numeric, images |
| Enterprise | ERP, CRM, supply chain | Structured |
Section 2: Applications and Challenges
2.1 Applications of Big Data
1. Healthcare:
- Disease prediction β analyzing patient history to predict illness.
- Drug discovery β finding drug interactions using genetic data.
- Epidemic tracking β monitoring disease spread (COVID-19 dashboards).
- Personalized medicine β treatment based on individual genome.
2. Finance and Banking:
- Fraud detection β real-time anomaly detection on credit card transactions.
- Risk assessment β evaluating loan default probability.
- Algorithmic trading β high-frequency trading decisions from market data.
- Customer 360 β full customer profile from all touchpoints.
3. Retail and E-commerce:
- Recommendation engines β Amazon/Netflix "you may also like".
- Demand forecasting β predicting inventory needs.
- Price optimization β dynamic pricing based on demand.
- Customer sentiment β analyzing reviews and social media.
4. Transportation:
- Traffic management β real-time rerouting (Google Maps).
- Predictive maintenance β airlines predicting engine failures before they happen.
- Ride-sharing optimization β Uber/Ola surge pricing and driver allocation.
- Self-driving cars β processing sensor data in real time.
5. Government and Smart Cities:
- Public safety β crime hotspot prediction.
- Energy management β smart grid optimization.
- Citizen services β tax fraud detection, welfare eligibility.
- Disaster response β resource allocation using real-time data.
6. Telecommunications:
- Churn prediction β identify customers likely to switch providers.
- Network optimization β identifying bottlenecks using call data records.
- Personalized plans β usage-based plan recommendations.
7. Manufacturing:
- Predictive maintenance β prevent machine breakdowns using sensor data.
- Quality control β defect detection using computer vision.
- Supply chain optimization β reducing delays and wastage.
8. Agriculture:
- Crop yield prediction using satellite + weather data.
- Precision farming β targeted irrigation and fertilization.
- Disease detection β identifying crop disease from drone imagery.
2.2 Challenges of Big Data
Despite its potential, Big Data comes with significant challenges:
1. Storage:
- Petabytes and exabytes of data need massive, scalable storage.
- Traditional RDBMS cannot store semi-structured or unstructured data.
2. Processing Speed:
- Data arrives faster than it can be processed (streaming data).
- Batch processing too slow for real-time use cases.
3. Data Variety:
- Data comes in structured, semi-structured, and unstructured formats.
- Integrating text, images, video, JSON, XML, CSV is complex.
4. Data Quality:
- Raw big data is often dirty β missing values, duplicates, inconsistencies.
- "Garbage In, Garbage Out" β poor quality input = poor insights.
5. Privacy and Security:
- Handling sensitive data (medical, financial) requires strict compliance.
- Risk of data breaches at scale.
- GDPR, HIPAA, and other regulations must be followed.
6. Talent Gap:
- Shortage of skilled data engineers, data scientists, and architects.
7. Cost:
- Infrastructure (servers, cloud, bandwidth) is expensive.
- ROI must justify the investment.
8. Data Governance:
- Deciding who owns data, who can access it, and how long to keep it.
Section 3: Characteristics of Big Data β The 6 V's
3.1 What are the V's of Big Data?
The characteristics of Big Data are described using V's. Originally there were 3 V's (Gartner, 2001), which expanded to 6 V's over time.
βββββββββββββββ
β BIG DATA β
ββββββββ¬βββββββ
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ βΌ βΌ βΌ
Volume Velocity Variety Veracity Value Variability
3.2 The 6 V's β Detailed
V1 β Volume
Definition: The amount / size of data being generated and stored.
- We have moved from gigabytes β terabytes β petabytes β exabytes.
- Traditional systems crash under this load.
Scale Reference:
1 KB = 1,000 bytes
1 MB = 1,000 KB (one song)
1 GB = 1,000 MB (a movie)
1 TB = 1,000 GB (a library of books)
1 PB = 1,000 TB (Facebook stores ~100 PB)
1 EB = 1,000 PB (all internet traffic per month)
1 ZB = 1,000 EB (total global data in 2020 β 40 ZB)
Challenge: Where to store this? β Distributed File Systems (HDFS)
V2 β Velocity
Definition: The speed at which data is generated and must be processed.
- Twitter generates ~6,000 tweets per second.
- Stock exchange processes millions of trades per second.
- IoT devices stream data continuously.
Types of Processing:
Batch Processing: Collect data β Process later (hours/days)
Example: Monthly billing reports
Stream Processing: Process data as it arrives (milliseconds)
Example: Fraud detection, live traffic
Challenge: How to process fast enough? β Apache Kafka, Apache Storm, Spark Streaming
V3 β Variety
Definition: The diversity of data types and formats in Big Data.
Three Types of Data:
ββββββββββββββββββββ¬βββββββββββββββββββ¬ββββββββββββββββββββββ
β Structured β Semi-Structured β Unstructured β
ββββββββββββββββββββΌβββββββββββββββββββΌββββββββββββββββββββββ€
β Fixed schema β Flexible schema β No fixed schema β
β Rows & columns β Tags / key-value β Free-form β
β SQL databases β JSON, XML, CSV β Text, images, video β
β Example: β Example: β Example: β
β Bank records β Twitter API data β Medical images β
ββββββββββββββββββββ΄βββββββββββββββββββ΄ββββββββββββββββββββββ
Challenge: How to store and query all formats? β NoSQL databases, Data Lakes
V4 β Veracity
Definition: The quality, accuracy, and trustworthiness of the data.
- Big Data is often noisy, incomplete, biased, or inconsistent.
- Wrong data leads to wrong conclusions β dangerous in healthcare or finance.
Sources of bad veracity:
- Sensor malfunction β wrong readings.
- User input errors β misspelled names, wrong dates.
- Social media noise β sarcasm, fake news, bots.
- Missing values β incomplete records.
Challenge: How to ensure data quality? β Data cleaning, validation pipelines, data governance
V5 β Value
Definition: The usefulness or business benefit derived from analyzing the data.
- Not all big data is valuable. Most raw data has low value density.
- The goal is to extract high-value insights from low-value-density data.
Value Chain:
Raw Data β Processed Data β Information β Knowledge β Wisdom β Business Value
Example: A billion web clicks have low individual value, but analyzed together they reveal purchasing patterns worth millions in targeted ads.
Challenge: How to extract value efficiently? β ML models, analytics platforms, BI tools
V6 β Variability
Definition: The inconsistency in data β same data meaning different things at different times or contexts.
- A word like "bank" can mean a financial institution or a river bank.
- Sentiment of a tweet depends on context and time.
- Data formats change over time.
Difference from Variety:
- Variety = different types of data (text vs video).
- Variability = the same type of data having inconsistent meaning or format.
Challenge: Context-aware processing, NLP, semantic analysis.
3.3 Summary Table β The 6 V's
| V | Name | Question it Answers | Challenge |
|---|---|---|---|
| V1 | Volume | How much? | Storage, scalability |
| V2 | Velocity | How fast? | Real-time processing |
| V3 | Variety | What types? | Integration, NoSQL |
| V4 | Veracity | How accurate? | Data quality, cleaning |
| V5 | Value | How useful? | Insight extraction |
| V6 | Variability | How consistent? | Context, semantics |
Section 4: Dimensions of Scalability
4.1 What is Scalability?
Scalability is the ability of a system to handle growing amounts of work (more data, more users, more requests) by adding resources.
4.2 Two Types of Scaling
Vertical Scaling (Scale Up)
Before: After:
βββββββββββ ββββββββββββββββ
β Server β βββΊ β Bigger Serverβ
β 8 GB β β 64 GB RAM β
β 4 cores β β 32 cores β
βββββββββββ ββββββββββββββββ
- Add more CPU, RAM, or disk to a single machine.
- Simple but has physical limits.
- Expensive at high end.
- Single point of failure.
Horizontal Scaling (Scale Out)
Before: After:
βββββββββββ βββββββ βββββββ βββββββ
β Server β βββΊ β S1 β β S2 β β S3 β
βββββββββββ βββββββ βββββββ βββββββ
- Add more machines (nodes) to a cluster.
- Big Data systems use horizontal scaling (commodity hardware).
- Basis of Hadoop, Spark, NoSQL.
| Vertical | Horizontal | |
|---|---|---|
| How | Bigger machine | More machines |
| Cost | High per unit | Low (commodity) |
| Limit | Physical hardware limit | Virtually unlimited |
| Failure | Single point of failure | Fault tolerant |
| Big Data fit | Poor | Excellent |
4.3 Dimensions of Scalability in Big Data
| Dimension | What scales? | Example |
|---|---|---|
| Data Volume | Amount of stored data | Adding more HDFS nodes |
| Throughput | Data processed per second | More Spark workers |
| Query Speed | Response time for queries | Partitioning, indexing |
| Concurrency | Simultaneous users/jobs | Load balancing |
| Geographic | Multiple data centers | Cloud regions (AWS) |
Section 5: Data Science β Getting Value out of Big Data
5.1 What is Data Science?
Definition:
Data Science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data.
It combines:
- Statistics β mathematical analysis
- Computer Science β programming and algorithms
- Domain Knowledge β understanding of the specific field
Data Science vs Big Data:
| Big Data | Data Science | |
|---|---|---|
| Focus | Infrastructure to store and process large data | Methods to extract insights from data |
| Tools | Hadoop, Spark, HDFS | Python, R, ML, Statistics |
| Role | Data Engineer | Data Scientist |
| Output | Scalable pipelines | Actionable insights |
5.2 Steps in the Data Science Process
The data science process is a structured workflow for turning raw data into valuable insights.
Overview:
βββββββββββββββββββ
β 1. Problem β
β Definition β
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β 2. Data β
β Collection β
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β 3. Data β
β Preparation β
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β 4. Exploratory β
β Analysis β
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β 5. Model β
β Building β
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β 6. Evaluation β
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β 7. Deployment β
β & Storytelling β
βββββββββββββββββββ
5.3 Each Step Explained
Step 1: Problem Definition
- Clearly define the business question to answer.
- Example: "Which customers are most likely to churn next month?"
- Bad-defined problems = wasted effort.
Step 2: Data Collection
- Identify and gather data from relevant sources.
- Sources: databases, APIs, web scraping, sensors, surveys.
- Key question: Is there enough data to answer the question?
Step 3: Data Preparation (Wrangling / Cleaning)
- Real data is messy. This is the most time-consuming step (60β80% of effort).
- Tasks:
- Handle missing values.
- Remove duplicates.
- Fix format inconsistencies.
- Encode categorical variables.
- Normalize / scale features.
Step 4: Exploratory Data Analysis (EDA)
- Understand the data before building models.
- Tasks:
- Summary statistics (mean, median, std).
- Data visualization (histograms, scatter plots, heatmaps).
- Find correlations and outliers.
- Choose appropriate ML techniques.
Step 5: Model Building
- Apply ML or statistical algorithms to the prepared data.
- Split data: training set + test set.
- Try multiple models, tune hyperparameters.
- Example: Decision tree, regression, clustering.
Step 6: Evaluation
- Measure how well the model performs on unseen data.
- Metrics: Accuracy, F1-score, RMSE, AUC (covered in ML Unit 4).
- If performance is poor β go back to Step 3 or 5.
Step 7: Deployment and Communication
- Deploy the model in a production system (web API, dashboard, app).
- Communicate findings to stakeholders using visualizations and storytelling.
- Monitor model over time for degradation.
Section 6: Foundations for Big Data Systems and Programming
6.1 What Makes a Big Data System?
A Big Data system must handle the 6 V's. The foundation consists of:
βββββββββββββββββββββββββββββββββββββββ
β Big Data System Stack β
βββββββββββββββββββββββββββββββββββββββ€
β Analytics / ML Layer β (Spark MLlib, Hive, Pig)
βββββββββββββββββββββββββββββββββββββββ€
β Processing Layer β (MapReduce, Spark, Flink)
βββββββββββββββββββββββββββββββββββββββ€
β Storage Layer β (HDFS, S3, HBase, Cassandra)
βββββββββββββββββββββββββββββββββββββββ€
β Resource Management β (YARN, Kubernetes)
βββββββββββββββββββββββββββββββββββββββ€
β Hardware Layer β (Commodity servers, cloud)
βββββββββββββββββββββββββββββββββββββββ
6.2 Key Concepts
Cluster Computing
A cluster is a group of connected computers (nodes) that work together as one system.
ββββββββββββββββββββββββ
β Master Node β β manages and coordinates
ββββββββββββ¬ββββββββββββ
βββββββββββΌββββββββββ
βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ
βWorker 1β βWorker 2β βWorker 3β β do the actual work
ββββββββββ ββββββββββ ββββββββββ
- Each worker stores and processes a portion of the data.
- Master node coordinates task assignment.
MapReduce Programming Model
MapReduce is the fundamental programming model for processing Big Data in parallel across a cluster.
Two phases:
MAP Phase:
Input Data β Split into chunks β Each chunk processed in parallel
β Produces key-value pairs (intermediate output)
REDUCE Phase:
Group all values by key β Aggregate / combine values
β Final output
Example β Word Count:
Input: "cat dog cat bird dog cat"
MAP output (key-value pairs):
(cat,1) (dog,1) (cat,1) (bird,1) (dog,1) (cat,1)
REDUCE (group by key, sum values):
cat β 3
dog β 2
bird β 1
Final Output:
cat:3, dog:2, bird:1
Data Replication
- Each piece of data is stored on multiple nodes (default: 3 copies in HDFS).
- If one node fails, data is still available from another.
- Provides fault tolerance.
Data Block A
ββββββββββββββββββββββββββββββββββββββββ
β Copy 1 β Node 1 β
β Copy 2 β Node 3 (different rack)β
β Copy 3 β Node 5 β
ββββββββββββββββββββββββββββββββββββββββ
6.3 Big Data Programming Tools
| Tool | Type | Purpose |
|---|---|---|
| Hadoop | Framework | Distributed storage + MapReduce processing |
| Apache Spark | Processing engine | Fast in-memory distributed processing |
| Apache Kafka | Messaging | Real-time data streaming |
| Apache Hive | Query tool | SQL-like queries on HDFS data |
| Apache HBase | NoSQL DB | Real-time read/write on HDFS |
| Apache Pig | Scripting | Data transformation using Pig Latin |
| Apache Flink | Stream processing | Low-latency stream analytics |
| YARN | Resource manager | Cluster resource allocation |
Section 7: Distributed File Systems
7.1 What is a Distributed File System?
Definition:
A Distributed File System (DFS) is a file system that stores data across multiple machines in a network, but makes it look like a single unified file system to the user.
Why needed?
- A single disk cannot hold petabytes of data.
- A distributed file system spreads data across hundreds or thousands of nodes.
- Provides scalability, fault tolerance, and high throughput.
7.2 HDFS β Hadoop Distributed File System
HDFS is the most widely used distributed file system for Big Data, and the storage backbone of the Hadoop ecosystem.
Key Design Goals:
- Store very large files (GB to TB per file).
- Run on commodity hardware (cheap, standard servers).
- Detect and recover from hardware failures automatically.
- Optimized for batch processing (high throughput over low latency).
7.3 HDFS Architecture
βββββββββββββββββββββββββββββββ
β NameNode β β Master
β (metadata: file locations) β
βββββββββββββββ¬ββββββββββββββββ
β
βββββββββββββββΌβββββββββββββββ
βΌ βΌ βΌ
ββββββββββββ ββββββββββββ ββββββββββββ
βDataNode 1β βDataNode 2β βDataNode 3β β Workers
β Block A β β Block B β β Block A β
β Block C β β Block A β β Block B β
ββββββββββββ ββββββββββββ ββββββββββββ
Two Types of Nodes:
| Node | Role |
|---|---|
| NameNode (Master) | Stores metadata β which blocks are in which DataNode. Does NOT store actual data. |
| DataNode (Worker) | Stores actual data blocks. Reports heartbeat to NameNode. |
7.4 How HDFS Works
Writing a File:
1. Client contacts NameNode β "I want to write file.txt"
2. NameNode assigns block locations across DataNodes.
3. Client writes data block-by-block to DataNodes.
4. Each block is replicated to 3 DataNodes (default replication factor = 3).
5. DataNodes send confirmation.
6. NameNode updates metadata.
Reading a File:
1. Client contacts NameNode β "I want to read file.txt"
2. NameNode returns list of DataNodes holding each block.
3. Client reads blocks directly from DataNodes (parallel).
4. Blocks are assembled into the complete file.
7.5 HDFS Key Concepts
Block Size
- HDFS divides files into large fixed-size blocks (default: 128 MB in modern Hadoop).
- A 1 GB file β 8 blocks of 128 MB each.
- Large blocks reduce NameNode metadata overhead.
file.txt (512 MB)
ββββββββ¬βββββββ¬βββββββ¬βββββββ
βBlock1βBlock2βBlock3βBlock4β each = 128 MB
ββββββββ΄βββββββ΄βββββββ΄βββββββ
Replication
- Each block is copied to 3 DataNodes (default replication factor = 3).
- Strategy: 1 copy on local rack, 2 copies on different racks.
- Protects against both node failure and rack failure.
Fault Tolerance
- DataNodes send heartbeat signals every 3 seconds to NameNode.
- If NameNode doesn't hear from a DataNode β assumes it's dead.
- NameNode automatically re-replicates the lost blocks from remaining copies.
7.6 HDFS Features and Limitations
Features:
β
Handles files of size GB to PB.
β
Runs on cheap commodity hardware.
β
Automatic fault tolerance through replication.
β
High throughput for batch reads.
β
Scales horizontally β just add more DataNodes.
Limitations:
β High latency β not suited for real-time (online) applications.
β Not good for small files (wastes NameNode metadata space).
β Does not support random writes β files are write-once, read-many.
β NameNode is a single point of failure (addressed by HA NameNode in Hadoop 2+).
7.7 Other Distributed File Systems
| DFS | By | Key Features |
|---|---|---|
| HDFS | Apache/Hadoop | Open source, MapReduce native |
| Amazon S3 | AWS | Object storage, highly available |
| Google GFS | Inspired HDFS design | |
| Azure Blob | Microsoft | Cloud object storage |
| Ceph | Open source | Block + object + file storage |
| GlusterFS | Red Hat | Network-attached storage |
Quick Revision Points
Big Data β Why & Where:
- Data explosion from social media, IoT, digitization, e-commerce.
- Sources: social, sensor, transactional, web, scientific.
6 V's:
| V | Name | Key Point |
|---|---|---|
| V1 | Volume | Petabytes of data |
| V2 | Velocity | Real-time streaming |
| V3 | Variety | Structured / semi / unstructured |
| V4 | Veracity | Data quality and accuracy |
| V5 | Value | Business insight extraction |
| V6 | Variability | Context-dependent meaning |
Data Science Process (7 Steps):
- Problem Definition β 2. Data Collection β 3. Data Preparation β 4. EDA β 5. Model Building β 6. Evaluation β 7. Deployment
Scaling:
- Vertical = bigger machine (limited).
- Horizontal = more machines (Big Data approach).
HDFS:
- NameNode = master (metadata only).
- DataNode = worker (stores blocks).
- Block size = 128 MB (default).
- Replication factor = 3.
- Fault tolerant via heartbeat + re-replication.
MapReduce:
Map β (key, value) pairs β Shuffle & Sort β Reduce β Final output
Expected Exam Questions
PYQs will be added after analysis β check back soon.
These notes were compiled by Deepak Modi
Last updated: May 2026