BDASemester 8

Unit 4: Big Data Integration and Processing

Big Data Processing, Data Query and Retrieval, Information Integration, Processing Pipelines, Analytical and Aggregation Operations, and Big Data Workflow Management Tools.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Big Data Integration and Processing: Big Data Processing, Retrieving: Data Query and retrieval, Information Integration, Big Data Processing pipelines, Analytical operations, Aggregation operation, High level Operation, Tools and Systems: Big Data workflow Management.


🎯 PYQ Analysis for Unit 4

PYQs will be added after analysis β€” check back soon.


Section 1: Big Data Processing

1.1 What is Big Data Processing?

Definition:

Big Data Processing is the act of collecting, transforming, analyzing, and generating insights from massive datasets using distributed computing systems. Since Big Data cannot fit on a single machine, processing is spread across a cluster of nodes working in parallel.


1.2 Types of Big Data Processing

1. Batch Processing

  • Data is collected over a time period and processed all at once.
  • High throughput, not real-time.
  • Suitable for large, stable datasets.
Collect data β†’ Store β†’ Process entire dataset β†’ Output results

Example:
  Daily sales data β†’ Hadoop MapReduce β†’ Revenue report

Characteristics:

FeatureDescription
LatencyHigh (hours to days)
ThroughputVery high
Data sizeVery large (entire dataset)
Use caseETL jobs, billing, historical analysis

Tools: Hadoop MapReduce, Apache Spark (batch mode), Apache Hive.


2. Real-Time / Stream Processing

  • Data is processed continuously as it arrives, event by event.
  • Very low latency (milliseconds to seconds).
  • Data is never fully "at rest" β€” processed on the fly.
Data arrives β†’ Immediate processing β†’ Immediate output

Example:
  Credit card transaction β†’ Fraud model β†’ Alert (within 200ms)

Characteristics:

FeatureDescription
LatencyVery low (ms–seconds)
ThroughputLower than batch
Data sizeUnbounded stream (infinite)
Use caseFraud detection, live dashboards, IoT alerts

Tools: Apache Kafka, Apache Flink, Apache Storm, Spark Streaming.


3. Interactive / Ad-hoc Processing

  • User runs on-demand queries against stored data.
  • Results expected within seconds to minutes.
  • Analyst-driven exploration.
Analyst types query β†’ Engine processes β†’ Results shown in seconds

Example:
  SELECT region, SUM(sales) FROM data WHERE year=2025 GROUP BY region;

Tools: Apache Hive, Presto, Apache Impala, Google BigQuery, Spark SQL.


4. Graph Processing

  • Specialized processing for graph-structured data.
  • Traverses nodes and edges to find patterns.
  • Example: Finding shortest path, detecting communities.

Tools: Apache Spark GraphX, Apache Giraph, Neo4j.


1.3 Batch vs Stream Processing

BatchStream
TriggerScheduled or manualContinuous
DataBounded (fixed size)Unbounded (infinite)
LatencyHoursMilliseconds
ComplexityLowerHigher
ExampleNightly ETLLive fraud detection
ToolsHadoop, HiveKafka, Flink, Storm

1.4 Lambda Architecture

Lambda Architecture combines batch and stream processing to get the best of both worlds.

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚         Data Sources             β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β–Ό               β”‚                β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  Batch Layer β”‚        β”‚      β”‚   Speed Layer    β”‚
       β”‚ (Hadoop/Sparkβ”‚        β”‚      β”‚ (Kafka + Flink)  β”‚
       β”‚  full data)  β”‚        β”‚      β”‚ (real-time views)β”‚
       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                β”‚               β”‚
              β–Ό                β”‚               β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ Serving Layerβ”‚β—„β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚   Merge results  β”‚
       β”‚ (Hive/HBase) β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
         Query Results

Three Layers:

  • Batch Layer β€” processes all historical data, gives complete but delayed results.
  • Speed Layer β€” processes only recent data, gives fast but possibly incomplete results.
  • Serving Layer β€” merges batch and speed results, answers queries.

Section 2: Data Query and Retrieval

2.1 What is Data Querying?

Definition:

Data Query and Retrieval is the process of searching for and extracting specific data from a storage system based on conditions or criteria.


2.2 SQL-Based Querying on Big Data

Even though Big Data systems aren't always relational, many provide SQL interfaces because SQL is well-known and powerful.

SQL on Big Data Tools:

ToolSQL InterfaceUnderlying System
Apache HiveHiveQL (SQL-like)Hadoop / HDFS
Spark SQLStandard SQLApache Spark
PrestoStandard SQLMultiple data sources
Apache ImpalaStandard SQLHDFS / HBase
Google BigQueryStandard SQLGoogle Cloud
Amazon AthenaStandard SQLAmazon S3

2.3 Query Execution in Distributed Systems

When a query runs on a Big Data cluster, it is broken into parallel tasks:

User Query:
  SELECT city, COUNT(*) FROM orders GROUP BY city

                    β–Ό

        Query Planner / Optimizer
                    β”‚
         Split into parallel tasks
                    β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β–Ό            β–Ό            β–Ό
    Node 1       Node 2       Node 3
  (Process      (Process     (Process
  Shard 1)      Shard 2)     Shard 3)
       β”‚            β”‚            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
              Merge & Aggregate
                    β–Ό
              Final Result

2.4 Indexing in Big Data

Indexing speeds up data retrieval by creating a fast lookup structure.

Index TypeDescriptionExample
Primary IndexOn primary keyUser ID lookup
Secondary IndexOn non-key columnsSearch by city
Inverted IndexWord β†’ document mappingFull-text search (Elasticsearch)
Bitmap IndexBit vector per valueGender, status columns
Partitioned IndexIndex per partitionDate-partitioned tables

2.5 NoSQL Query Patterns

Different NoSQL systems have different query styles:

SystemQuery StyleExample
MongoDBDocument querydb.users.find({city:"Jaipur"})
CassandraCQL (SQL-like)SELECT * FROM users WHERE id=001
HBaseRow key + column scanget 'users', 'row001'
RedisGET/SET commandsGET user:001
Neo4jCypher graph queryMATCH (u:User)-[:FRIEND]->(f) RETURN f

Section 3: Information Integration

3.1 What is Information Integration?

Definition:

Information Integration is the process of combining data from multiple, heterogeneous sources into a unified view that can be queried and analyzed as if it were a single dataset.

Why needed?

  • Organizations have data in many places: RDBMS, cloud, files, APIs.
  • Analysts need a single consistent view.
  • Without integration: siloed data, inconsistent reports, duplicate effort.

3.2 Challenges of Information Integration

ChallengeDescription
Schema heterogeneityDifferent sources use different field names / types
Semantic conflicts"customer" in one system = "client" in another
Data format mismatchCSV in one, JSON in another
InconsistencySame customer has different names in two systems
ScaleIntegrating petabytes across dozens of sources
LatencyKeeping integrated view up-to-date in real time

3.3 Approaches to Information Integration

1. Data Warehousing (ETL-based)

  • Extract from all sources, transform to common schema, load into warehouse.
  • Centralized integrated view.
  • Batch, not real-time.

2. Data Virtualization

  • No data is physically moved.
  • A virtual layer translates queries to each source on the fly.
  • Real-time but complex.
User Query β†’ Virtual Layer β†’ Query Source A + Source B β†’ Merge β†’ Result

3. Data Federation

  • Similar to virtualization β€” a federated engine queries multiple sources simultaneously.
  • Tools: Presto, Apache Drill.

4. Data Lake Integration

  • Dump all raw data into a data lake.
  • Use schema-on-read to integrate at query time.

3.4 Master Data Management (MDM)

MDM ensures there is a single, authoritative, consistent version of key business data (customers, products, locations).

Multiple systems have "customer" data:
  CRM:     ID=C01, Name=Deepak Modi
  Billing: ID=B99, Name=D. Modi
  Website: ID=W445, Name=deepak.modi

MDM resolves these to:
  Master: ID=M001, Name=Deepak Modi (canonical)

Section 4: Big Data Processing Pipelines

4.1 What is a Processing Pipeline?

Definition:

A Big Data Processing Pipeline is a series of data processing steps chained together, where the output of one step is the input of the next, forming an automated data flow from raw ingestion to final output.

Ingest β†’ Validate β†’ Clean β†’ Transform β†’ Analyze β†’ Store β†’ Serve
  β”‚          β”‚         β”‚         β”‚           β”‚        β”‚       β”‚
  β–Ό          β–Ό         β–Ό         β–Ό           β–Ό        β–Ό       β–Ό
Kafka     Schema    Remove    Aggregate   ML Model  HDFS  Dashboard
          check     nulls     by region             /S3

4.2 Pipeline Stages

StageActivityTools
IngestionCollect data from sourcesKafka, Flume, Sqoop
ValidationCheck data quality and schemaGreat Expectations
CleaningRemove bad/missing dataSpark, Pandas
TransformationReshape, enrich, aggregateSpark, dbt, Hive
AnalysisApply models or queriesSpark MLlib, SQL
StorageSave resultsHDFS, S3, RDBMS
ServingExpose to end usersAPIs, dashboards

4.3 Pipeline Patterns

1. ETL Pipeline

Source β†’ Extract β†’ Transform β†’ Load β†’ Warehouse

2. Streaming Pipeline

Events β†’ Kafka β†’ Flink/Spark Streaming β†’ Real-time store β†’ Dashboard

3. ML Pipeline

Raw data β†’ Clean β†’ Feature engineering β†’ Train model β†’ Evaluate β†’ Deploy

4. Reverse ETL Pipeline

Data Warehouse β†’ Extract β†’ Load into β†’ CRM / Marketing Tool
(Insights pushed back to operational systems)

4.4 Pipeline Design Principles

  1. Idempotency β€” running the pipeline multiple times gives the same result (no duplicates).
  2. Fault Tolerance β€” pipeline recovers from failures without data loss.
  3. Modularity β€” each stage is independent and replaceable.
  4. Monitoring β€” every stage logs metrics and alerts on failure.
  5. Scalability β€” pipeline scales with data volume automatically.

Section 5: Analytical Operations

5.1 What are Analytical Operations?

Analytical operations are computations performed on data to derive insights, patterns, and knowledge.


5.2 Types of Analytical Operations

1. Descriptive Analytics

  • Describes what happened in the past.
  • Summary statistics: mean, median, mode, min, max.
  • Visualizations: charts, dashboards.
"Total sales in Q1 2026 were β‚Ή4.5 crore."

2. Diagnostic Analytics

  • Explains why something happened.
  • Drill-down analysis, correlation analysis.
"Sales dropped in March because of a supply chain disruption."

3. Predictive Analytics

  • Predicts what will happen in the future.
  • Uses ML models, regression, time-series forecasting.
"Sales in Q2 2026 are predicted to grow by 12%."

4. Prescriptive Analytics

  • Recommends what action to take.
  • Optimization algorithms, simulation.
"To maximize sales, increase inventory in Zone A by 20%."

Analytics Maturity Ladder:

   Prescriptive  ← What should we do?       (Most complex, most value)
       β–²
   Predictive    ← What will happen?
       β–²
   Diagnostic    ← Why did it happen?
       β–²
   Descriptive   ← What happened?            (Simplest, least value)

5.3 OLAP Operations

OLAP (Online Analytical Processing) is a set of operations for multidimensional data analysis.

OLAP Cube:

         Time
          β”‚
          β”‚  ──── Sales Data ────
          β”‚
          └─────────────────────► Region
         /
        /
     Product

Key OLAP Operations:

OperationDescriptionExample
Roll-upAggregate to higher levelDay β†’ Month β†’ Year
Drill-downGo to finer detailYear β†’ Month β†’ Day
SliceFix one dimension, view othersSales for 2025 only
DiceFix multiple dimensionsSales in 2025, in North region
PivotRotate the cube (change view)Show products as rows, regions as columns

Section 6: Aggregation Operations

6.1 What is Aggregation?

Definition:

Aggregation is the process of combining multiple data values into a single summary value. It reduces a large dataset to meaningful statistics.


6.2 Common Aggregation Functions

FunctionDescriptionExample
COUNTNumber of recordsCOUNT(*) = 1000 rows
SUMTotal of valuesSUM(sales) = β‚Ή50,000
AVGAverage valueAVG(age) = 25.3
MINMinimum valueMIN(price) = β‚Ή99
MAXMaximum valueMAX(price) = β‚Ή9,999
STDDEVStandard deviationSpread of salaries
MEDIANMiddle valueMedian income
GROUP BYAggregate per groupSales per region

6.3 Aggregation in Big Data Systems

MapReduce Aggregation

MAP:     Emit (region, sales_amount) for each transaction
REDUCE:  For each region, SUM all sales_amounts
OUTPUT:  (North, 50000), (South, 35000), (East, 42000)

Spark Aggregation

df.groupBy("region").agg(
    F.sum("sales").alias("total_sales"),
    F.count("order_id").alias("order_count"),
    F.avg("sales").alias("avg_sales")
)

Hive SQL Aggregation

SELECT region,
       SUM(sales)   AS total_sales,
       COUNT(*)     AS num_orders,
       AVG(sales)   AS avg_order
FROM sales_table
GROUP BY region
ORDER BY total_sales DESC;

6.4 Aggregation Challenges in Big Data

ChallengeSolution
Data skewOne key has far more records than others β†’ slow reducer
Memory overflowAggregation result too large for RAM
Approximate aggregationExact count of billions is slow
Streaming aggregationContinuous data, can't wait for all

Section 7: High-Level Operations

7.1 What are High-Level Operations?

High-level operations in Big Data refer to complex, multi-step data transformations and analyses that go beyond simple CRUD or aggregation. They are often expressed in declarative languages and executed as optimized distributed jobs.


7.2 Key High-Level Operations

1. Join Operations

Combining data from two or more datasets.

Join TypeDescription
Inner JoinOnly matching records from both datasets
Left JoinAll records from left + matching from right
Right JoinAll records from right + matching from left
Full Outer JoinAll records from both, nulls where no match
Cross JoinEvery combination (Cartesian product)

Distributed Join Strategies:

  • Broadcast Join β€” small dataset sent to all nodes (avoids shuffle).
  • Sort-Merge Join β€” both datasets sorted, then merged (large datasets).
  • Hash Join β€” one dataset hashed, other probes the hash table.

2. Sorting and Ordering

  • Distributed sorting requires shuffle β€” moving data between nodes.
  • Total Order Sorting β€” globally sorted across all nodes (expensive).
  • Partial Sort β€” sorted within each partition (cheaper).

3. Sampling

  • Taking a representative subset of data for faster analysis or testing.
Types:
  Random Sampling      β†’ Pick N% of rows randomly
  Stratified Sampling  β†’ Maintain class proportions
  Reservoir Sampling   β†’ Sample from a stream of unknown size

4. Filtering and Projection

  • Filtering (WHERE) β€” keep only rows matching a condition.
  • Projection (SELECT columns) β€” keep only needed columns.

In columnar formats (Parquet), projection pushdown means only the required columns are read from disk β€” huge performance win.


5. Deduplication

  • Remove duplicate records from a large dataset.
  • Challenging at scale β€” cannot load everything into memory.
Techniques:
  Exact dedup     β†’ Hash each row, remove identical hashes
  Fuzzy dedup     β†’ Use similarity (e.g., "Deepak" = "Deepk")
  Bloom filters   β†’ Probabilistic set membership check

6. Windowing / Sliding Window Operations

Used in stream processing to analyze data within a time window.

Tumbling Window:   Fixed, non-overlapping windows
  [0–10s] [10–20s] [20–30s]

Sliding Window:    Overlapping windows
  [0–10s] [5–15s] [10–20s]

Session Window:    Based on activity gaps
  (group events close in time, split by inactivity)

Example:

Count fraud alerts in the last 5 minutes, updated every 1 minute.
β†’ Sliding window: size=5min, slide=1min

Section 8: Big Data Workflow Management

8.1 What is Workflow Management?

Definition:

Big Data Workflow Management is the orchestration and scheduling of complex, multi-step data processing jobs to ensure they run in the correct order, handle failures gracefully, and complete reliably.

A workflow (also called a DAG β€” Directed Acyclic Graph) defines the sequence and dependencies of tasks.

DAG Example:
  Extract_Data β†’ Clean_Data β†’ Transform β†’ Train_Model β†’ Evaluate β†’ Deploy
       β”‚               β”‚            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          (no cycles allowed)

8.2 Why Workflow Management is Needed

  • Big Data pipelines have many dependent steps β€” step 3 can't run until step 2 finishes.
  • Jobs run on a schedule (daily, hourly).
  • Failures need to be detected and retried automatically.
  • Teams need visibility into what is running, what failed, and why.

8.3 Apache Airflow

Apache Airflow is the most popular open-source workflow orchestration tool for data pipelines.

Key Concepts:

ConceptDescription
DAGDirected Acyclic Graph β€” the workflow definition
TaskOne unit of work (e.g., run a Spark job)
OperatorTemplate for a task type (PythonOperator, BashOperator, SparkOperator)
SchedulerTriggers DAGs on schedule
ExecutorRuns the actual tasks (LocalExecutor, CeleryExecutor)
Web UIDashboard to monitor DAGs and task status

Example DAG (simplified):

with DAG("daily_etl", schedule_interval="@daily") as dag:
    extract = PythonOperator(task_id="extract", python_callable=run_extract)
    transform = SparkSubmitOperator(task_id="transform", application="transform.py")
    load = PythonOperator(task_id="load", python_callable=run_load)

    extract >> transform >> load   # define order

8.4 Apache Oozie

Apache Oozie is a workflow scheduler specifically for Hadoop jobs.

  • Defines workflows in XML.
  • Supports MapReduce, Hive, Pig, Sqoop jobs.
  • Coordinator jobs for time-based scheduling.

8.5 Other Workflow Tools

ToolTypeKey Feature
Apache AirflowGeneral orchestrationPython DAGs, rich UI
Apache OozieHadoop-specificXML workflows, Hadoop-native
Apache NiFiData flow managementDrag-and-drop pipeline builder
PrefectModern orchestrationCloud-native, Python
DagsterData orchestrationStrong typing, testing
LuigiBatch orchestrationSimple Python tasks
AWS Step FunctionsCloud workflowServerless state machine
dbtTransformationSQL-based transform DAGs

8.6 Monitoring and Observability

A workflow system must also support monitoring:

MetricWhat to track
Job success rate% of pipelines completing without failure
LatencyHow long each stage takes
Data volumeRecords processed per job
Queue depthKafka lag β€” how far behind consumers are
Resource usageCPU, memory, disk on cluster nodes
Data quality% of records passing validation

Tools: Grafana, Prometheus, Apache Ambari, AWS CloudWatch, Datadog.


Quick Revision Points

Processing Types:

TypeLatencyTools
BatchHoursHadoop, Hive, Spark batch
Streamms–secondsKafka, Flink, Spark Streaming
InteractiveSecondsPresto, Spark SQL, BigQuery

Lambda Architecture:

  • Batch Layer (historical) + Speed Layer (real-time) + Serving Layer (merge).

OLAP Operations:

  • Roll-up, Drill-down, Slice, Dice, Pivot.

Analytics Types:

  • Descriptive β†’ Diagnostic β†’ Predictive β†’ Prescriptive.

Aggregation Functions:

  • COUNT, SUM, AVG, MIN, MAX, STDDEV, GROUP BY.

High-Level Operations:

  • Join, Sort, Sample, Filter, Project, Deduplicate, Window.

Windowing:

  • Tumbling (non-overlapping), Sliding (overlapping), Session (activity-based).

Workflow Management:

  • DAG = task dependency graph.
  • Apache Airflow = most popular orchestrator.
  • Oozie = Hadoop-specific.
  • NiFi = drag-and-drop flow management.

Expected Exam Questions

PYQs will be added after analysis β€” check back soon.


These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!