BDASemester 8

Unit 4: Big Data Integration and Processing

Big Data Processing, Data Query and Retrieval, Information Integration, Processing Pipelines, Analytical and Aggregation Operations, and Big Data Workflow Management Tools.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Big Data Integration and Processing: Big Data Processing, Retrieving: Data Query and retrieval, Information Integration, Big Data Processing pipelines, Analytical operations, Aggregation operation, High level Operation, Tools and Systems: Big Data workflow Management.


🎯 PYQ Analysis for Unit 4

High Priority Topics (15 marks questions) β€” βœ… Full Coverage

  1. MapReduce β€” Working + Word Count Program β€” (2023: 15 marks) β†’ covered in Β§1.5
  2. Hive β€” Features, Integration, Workflow, Architecture β€” (2023: 15 marks) β†’ covered in Β§9
  3. Pig β€” Architecture, Pig Latin Commands β€” (2023: 15 marks) β†’ covered in Β§10
  4. Real Time Analytics β€” technologies, working β€” (2022: 15 marks) β†’ covered in Β§1.6
  5. Big Data Processing Pipelines β€” components, working β€” (2024: 15 marks) β†’ covered in Β§4
  6. Hive Query Language (HQL) β€” (2023: 15 marks) β†’ covered in Β§9.7

Medium Priority Topics (Short answers)

  1. MapReduce (definition) β€” 2022 (2.5 marks) β†’ covered in Β§1.5
  2. Big Data Processing Pipeline (definition) β€” 2024 (2.5 marks) β†’ covered in Β§4.1

Unit 4 PYQ coverage is now comprehensive β€” all major 15-mark questions from 2022, 2023, and 2024 have dedicated sections.


Section 1: Big Data Processing

PYQ: Explain MapReduce. (2022, 2.5 marks)
PYQ: Explain MapReduce working. Write a program for Word Count using MapReduce. (2023, 15 marks)
PYQ: What is Real Time Analytics? Discuss their technologies in detail. (2022, 15 marks)

1.1 What is Big Data Processing?

Definition:

Big Data Processing is the act of collecting, transforming, analyzing, and generating insights from massive datasets using distributed computing systems. Since Big Data cannot fit on a single machine, processing is spread across a cluster of nodes working in parallel.


1.2 Types of Big Data Processing

1. Batch Processing

  • Data is collected over a time period and processed all at once.
  • High throughput, not real-time.
  • Suitable for large, stable datasets.
Collect data β†’ Store β†’ Process entire dataset β†’ Output results

Example:
  Daily sales data β†’ Hadoop MapReduce β†’ Revenue report

Characteristics:

FeatureDescription
LatencyHigh (hours to days)
ThroughputVery high
Data sizeVery large (entire dataset)
Use caseETL jobs, billing, historical analysis

Tools: Hadoop MapReduce, Apache Spark (batch mode), Apache Hive.


2. Real-Time / Stream Processing

  • Data is processed continuously as it arrives, event by event.
  • Very low latency (milliseconds to seconds).
  • Data is never fully "at rest" β€” processed on the fly.
Data arrives β†’ Immediate processing β†’ Immediate output

Example:
  Credit card transaction β†’ Fraud model β†’ Alert (within 200ms)

Characteristics:

FeatureDescription
LatencyVery low (ms–seconds)
ThroughputLower than batch
Data sizeUnbounded stream (infinite)
Use caseFraud detection, live dashboards, IoT alerts

Tools: Apache Kafka, Apache Flink, Apache Storm, Spark Streaming.


3. Interactive / Ad-hoc Processing

  • User runs on-demand queries against stored data.
  • Results expected within seconds to minutes.
  • Analyst-driven exploration.
Analyst types query β†’ Engine processes β†’ Results shown in seconds

Example:
  SELECT region, SUM(sales) FROM data WHERE year=2025 GROUP BY region;

Tools: Apache Hive, Presto, Apache Impala, Google BigQuery, Spark SQL.


4. Graph Processing

  • Specialized processing for graph-structured data.
  • Traverses nodes and edges to find patterns.
  • Example: Finding shortest path, detecting communities.

Tools: Apache Spark GraphX, Apache Giraph, Neo4j.


1.3 Batch vs Stream Processing

BatchStream
TriggerScheduled or manualContinuous
DataBounded (fixed size)Unbounded (infinite)
LatencyHoursMilliseconds
ComplexityLowerHigher
ExampleNightly ETLLive fraud detection
ToolsHadoop, HiveKafka, Flink, Storm

1.4 Lambda Architecture

Lambda Architecture combines batch and stream processing to get the best of both worlds.

              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚         Data Sources             β”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β–Ό               β”‚                β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚  Batch Layer β”‚        β”‚      β”‚   Speed Layer    β”‚
       β”‚ (Hadoop/Sparkβ”‚        β”‚      β”‚ (Kafka + Flink)  β”‚
       β”‚  full data)  β”‚        β”‚      β”‚ (real-time views)β”‚
       β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚      β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚                β”‚               β”‚
              β–Ό                β”‚               β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚      β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ Serving Layerβ”‚β—„β”€β”€β”€β”€β”€β”€β”€β”˜      β”‚   Merge results  β”‚
       β”‚ (Hive/HBase) β”‚β—„β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
              β”‚
              β–Ό
         Query Results

Three Layers:

  • Batch Layer β€” processes all historical data, gives complete but delayed results.
  • Speed Layer β€” processes only recent data, gives fast but possibly incomplete results.
  • Serving Layer β€” merges batch and speed results, answers queries.

1.5 MapReduce β€” Detailed Working & Word Count Program

PYQ: Explain MapReduce. (2022, 2.5 marks)
PYQ: Explain MapReduce working. Write a program for Word Count using MapReduce. (2023, 15 marks)

Definition:

MapReduce is a programming model and execution framework for processing very large datasets in parallel across a distributed cluster. It splits the job into two key user-defined functions β€” Map (transforms input into intermediate key-value pairs) and Reduce (aggregates intermediate values per key) β€” while the framework itself handles parallelism, fault tolerance, and data distribution.


Part A: MapReduce β€” Six (Seven) Phases of Working

   Input File
       β”‚
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Input Phase  β”‚  Record Reader β†’ (key, value) pairs
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Map Phase   β”‚  user-defined map() β†’ intermediate (k, v) pairs
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Intermediate Keys  β”‚  output of all mappers
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Combiner (OPTIONAL)      β”‚  local mini-reducer per mapper
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Shuffle & Sort      β”‚  pull, group, sort by key
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Reducer    β”‚  user-defined reduce() β†’ final (k, v) pairs
β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Output Phase β”‚  Output Formatter + Record Writer β†’ file
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Input Phase

  • A Record Reader translates each record in the input file into a parsed form.
  • Sends the parsed records to the mapper as key-value pairs (e.g., (byte_offset, line_text)).

2. Map Phase

  • The user-defined map() function takes a series of key-value pairs as input.
  • Processes each pair and generates zero or more output key-value pairs.
  • Runs in parallel on every chunk (input split) of the dataset.

3. Intermediate Keys

  • The key-value pairs generated by the mapper are called intermediate keys.
  • They are written to the local disk of each mapper node before being shuffled.

4. Combiner (Optional)

  • A local Reducer that groups similar data emitted from the Map phase into identifiable sets.
  • Aggregates values in a small scope (one mapper) β€” reduces the volume of data transferred during shuffle.
  • Not part of the main MapReduce algorithm β€” it is an optimization step.

5. Shuffle and Sort

  • The Reducer task downloads the grouped key-value pairs from all mappers to its local machine.
  • The pairs are sorted by key into a larger data list.
  • The list groups equivalent keys together so their values can be iterated easily inside the reducer.

6. Reducer

  • Takes the grouped key-value paired data as input.
  • Runs the user-defined reduce() function on each group.
  • Data can be aggregated, filtered, or combined in many ways.
  • Produces zero or more output key-value pairs.

7. Output Phase

  • An Output Formatter translates the final key-value pairs from the Reducer.
  • Writes them to a file in HDFS using a Record Writer.

Part B: Word Count Program (Pseudocode)

function MAP(String key, String value):
    // key is document name, value is the document content
    for each word w in value:
        emit(w, 1)

function REDUCE(String key, Iterator values):
    // key has a word name, values is a list of numbers
    sum = 0
    for each v in values:
        sum += v
    emit(key, sum)

Explanation of Word Count using MapReduce:

  • Input β†’ a collection of documents (key = document name, value = entire document as ASCII string).
  • The map function takes each (document-name, document) pair, traverses the document, and emits a (word, 1) pair for every word it finds.
  • The middle (shuffle/sort) stage sorts and groups these counts by word name so that all 1s for the same word arrive at the same reducer.
  • The reduce function adds up all the counts for each word and outputs a (word, total_count) tuple.
  • emit simply returns the key-value pair β€” from the map function it goes to a reducer; from the reduce function it goes back to the distributed file system.

Part C: Sample Run β€” "cat dog cat bird dog cat"

Input:

Document D1 = "cat dog cat bird dog cat"

Step 1: Map output (per word, emit 1):

(cat, 1)  (dog, 1)  (cat, 1)  (bird, 1)  (dog, 1)  (cat, 1)

Step 2: Shuffle & Sort (group by key):

bird β†’ [1]
cat  β†’ [1, 1, 1]
dog  β†’ [1, 1]

Step 3: Reduce output (sum the list):

(bird, 1)
(cat,  3)
(dog,  2)

Final result: cat = 3, dog = 2, bird = 1.


1.6 Real-Time Analytics β€” Detailed Technologies & Working

PYQ: What is Real Time Analytics? Discuss their technologies in detail. (2022, 15 marks)

What is Real-Time Analytics?

Real-time analytics in Big Data is the process of analyzing data as soon as it is generated. It lets users see, examine, and recognize data as it enters a system. Logic and mathematics are applied to that data so users can make real-time decisions.

  • Stands at the forefront of decision-making transformation β€” businesses can respond dynamically to changing market conditions, customer behaviors, and operational challenges.
  • Real-time analytics applications respond to queries within seconds and must handle large data with high velocity and low reaction time.
  • Widely used in financial databases to notify trading decisions.
  • Analytics can be on-demand (notifies results when the user requests) or continuous (renovates answers automatically on events).

Examples of real-time customer analytics:

  1. Viewing orders as they happen for better tracing and fashion identification.
  2. Continually modernizing customer activity (page views, shopping cart use) to understand user etiquette.
  3. Choosing customers with advancement as they shop for items in a store, affecting real-time decisions.

Real-Time Analytics β€” Working (5 Component Areas)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Data Ingestion (Kafka, Kinesis, RabbitMQ)               β”‚
β”‚                              ↓                              β”‚
β”‚ 2. Data Processing Engines (Flink, Storm, Spark Streaming) β”‚
β”‚                              ↓                              β”‚
β”‚ 3. Real-Time Querying (Druid, ClickHouse, Redshift)        β”‚
β”‚                              ↓                              β”‚
β”‚ 4. Data Storage (InfluxDB, MongoDB, Cassandra, HBase)      β”‚
β”‚                              ↓                              β”‚
β”‚ 5. Visualization & Alerts (Grafana, Kibana, Tableau)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

1. Data Ingestion

  • Continuous Data Collection β€” systems continuously collect data from sensors, IoT devices, social media feeds, transaction logs, and application databases in structured, semi-structured, and unstructured formats.
  • Stream Processing β€” data is ingested as streams, captured and processed in real-time as it arrives. Tools such as Apache Kafka, RabbitMQ, and Amazon Kinesis handle high-throughput streams reliably.

2. Data Processing Engines

  • Stream Processing Platforms β€” Apache Flink, Apache Storm, and Spark Streaming handle continuous data flows; they perform complex event processing, transformations, aggregations, and filtering in real-time.
  • In-Memory Processing β€” data is processed in memory rather than on disk to ensure low-latency processing; significantly speeds up processing time.
  • Parallel Processing β€” distributes the workload across multiple nodes and processors to handle large volumes efficiently.

3. Real-Time Querying

  • Low-Latency Query Engines β€” Apache Druid, ClickHouse, and Amazon Redshift Spectrum run queries on streaming data with minimal delay; they are optimized for low-latency execution.
  • Complex Queries β€” users can perform joins, aggregations, window functions, and pattern matching on streaming data for sophisticated real-time analysis.

4. Data Storage

  • Time-Series Databases β€” InfluxDB, TimescaleDB, and OpenTSDB are optimized for time-stamped data; they efficiently store and retrieve real-time data points.
  • NoSQL Databases β€” MongoDB, Cassandra, and HBase for unstructured and semi-structured data; flexible storage that scales horizontally.

5. Visualization Tools & Alerts

  • Dashboards / BI Tools β€” Tableau, Power BI, Grafana, and Kibana provide interactive and customizable visualizations to monitor and analyze data in real-time.
  • Alerts and Notifications β€” configurable triggers based on predefined conditions or thresholds; enable proactive responses to system failures, security breaches, and significant business-metric changes.

Real-Time Analytics Technologies (List)

TechnologyRole
Streaming data processingContinuous data flow from one or more sources at high speed
Event streaming platformsProcess data events as they occur
Apache KafkaEvent streaming platform providing a data store for ingesting/processing streams
Apache StormProcesses streaming data; identifies patterns and correlations
Apache FlinkOpen-source technology for stream processing
Apache PinotOpen-source technology for real-time analytics
Machine LearningDetects trends, makes faster decisions, automates actions and recommendations
Event-Driven Architecture (EDA)Architectural pattern letting applications respond to events in real-time
Data PipelineSet of steps that ingests, processes, and transforms data from various sources into an analysis-ready format

Section 2: Data Query and Retrieval

PYQ: What is Hive? Explain Hive features, Hive integration and workflow and architecture in detail. (2023, 15 marks)
PYQ: Write short note on Hive Query Language (HQL). (2023, 15 marks)
PYQ: Write short note on Pig architecture and commands. (2023, 15 marks)

2.1 What is Data Querying?

Definition:

Data Query and Retrieval is the process of searching for and extracting specific data from a storage system based on conditions or criteria.


2.2 SQL-Based Querying on Big Data

Even though Big Data systems aren't always relational, many provide SQL interfaces because SQL is well-known and powerful.

SQL on Big Data Tools:

ToolSQL InterfaceUnderlying System
Apache HiveHiveQL (SQL-like)Hadoop / HDFS
Spark SQLStandard SQLApache Spark
PrestoStandard SQLMultiple data sources
Apache ImpalaStandard SQLHDFS / HBase
Google BigQueryStandard SQLGoogle Cloud
Amazon AthenaStandard SQLAmazon S3

2.3 Query Execution in Distributed Systems

When a query runs on a Big Data cluster, it is broken into parallel tasks:

User Query:
  SELECT city, COUNT(*) FROM orders GROUP BY city

                    β–Ό

        Query Planner / Optimizer
                    β”‚
         Split into parallel tasks
                    β”‚
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β–Ό            β–Ό            β–Ό
    Node 1       Node 2       Node 3
  (Process      (Process     (Process
  Shard 1)      Shard 2)     Shard 3)
       β”‚            β”‚            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β–Ό
              Merge & Aggregate
                    β–Ό
              Final Result

2.4 Indexing in Big Data

Indexing speeds up data retrieval by creating a fast lookup structure.

Index TypeDescriptionExample
Primary IndexOn primary keyUser ID lookup
Secondary IndexOn non-key columnsSearch by city
Inverted IndexWord β†’ document mappingFull-text search (Elasticsearch)
Bitmap IndexBit vector per valueGender, status columns
Partitioned IndexIndex per partitionDate-partitioned tables

2.5 NoSQL Query Patterns

Different NoSQL systems have different query styles:

SystemQuery StyleExample
MongoDBDocument querydb.users.find({city:"Jaipur"})
CassandraCQL (SQL-like)SELECT * FROM users WHERE id=001
HBaseRow key + column scanget 'users', 'row001'
RedisGET/SET commandsGET user:001
Neo4jCypher graph queryMATCH (u:User)-[:FRIEND]->(f) RETURN f

Section 3: Information Integration

3.1 What is Information Integration?

Definition:

Information Integration is the process of combining data from multiple, heterogeneous sources into a unified view that can be queried and analyzed as if it were a single dataset.

Why needed?

  • Organizations have data in many places: RDBMS, cloud, files, APIs.
  • Analysts need a single consistent view.
  • Without integration: siloed data, inconsistent reports, duplicate effort.

3.2 Challenges of Information Integration

ChallengeDescription
Schema heterogeneityDifferent sources use different field names / types
Semantic conflicts"customer" in one system = "client" in another
Data format mismatchCSV in one, JSON in another
InconsistencySame customer has different names in two systems
ScaleIntegrating petabytes across dozens of sources
LatencyKeeping integrated view up-to-date in real time

3.3 Approaches to Information Integration

1. Data Warehousing (ETL-based)

  • Extract from all sources, transform to common schema, load into warehouse.
  • Centralized integrated view.
  • Batch, not real-time.

2. Data Virtualization

  • No data is physically moved.
  • A virtual layer translates queries to each source on the fly.
  • Real-time but complex.
User Query β†’ Virtual Layer β†’ Query Source A + Source B β†’ Merge β†’ Result

3. Data Federation

  • Similar to virtualization β€” a federated engine queries multiple sources simultaneously.
  • Tools: Presto, Apache Drill.

4. Data Lake Integration

  • Dump all raw data into a data lake.
  • Use schema-on-read to integrate at query time.

3.4 Master Data Management (MDM)

MDM ensures there is a single, authoritative, consistent version of key business data (customers, products, locations).

Multiple systems have "customer" data:
  CRM:     ID=C01, Name=Deepak Modi
  Billing: ID=B99, Name=D. Modi
  Website: ID=W445, Name=deepak.modi

MDM resolves these to:
  Master: ID=M001, Name=Deepak Modi (canonical)

Section 4: Big Data Processing Pipelines

PYQ: What do you mean by Big Data Processing pipeline? (2024, 2.5 marks)
PYQ: Describe the key components of Big Data processing pipelines. How do these components work together to ingest, process, and analyze large volumes of data? (2024, 15 marks)

4.1 What is a Processing Pipeline?

Definition:

A Big Data Processing Pipeline is a series of data processing steps chained together, where the output of one step is the input of the next, forming an automated data flow from raw ingestion to final output.

Ingest β†’ Validate β†’ Clean β†’ Transform β†’ Analyze β†’ Store β†’ Serve
  β”‚          β”‚         β”‚         β”‚           β”‚        β”‚       β”‚
  β–Ό          β–Ό         β–Ό         β–Ό           β–Ό        β–Ό       β–Ό
Kafka     Schema    Remove    Aggregate   ML Model  HDFS  Dashboard
          check     nulls     by region             /S3

4.2 Pipeline Stages

StageActivityTools
IngestionCollect data from sourcesKafka, Flume, Sqoop
ValidationCheck data quality and schemaGreat Expectations
CleaningRemove bad/missing dataSpark, Pandas
TransformationReshape, enrich, aggregateSpark, dbt, Hive
AnalysisApply models or queriesSpark MLlib, SQL
StorageSave resultsHDFS, S3, RDBMS
ServingExpose to end usersAPIs, dashboards

4.3 Pipeline Patterns

1. ETL Pipeline

Source β†’ Extract β†’ Transform β†’ Load β†’ Warehouse

2. Streaming Pipeline

Events β†’ Kafka β†’ Flink/Spark Streaming β†’ Real-time store β†’ Dashboard

3. ML Pipeline

Raw data β†’ Clean β†’ Feature engineering β†’ Train model β†’ Evaluate β†’ Deploy

4. Reverse ETL Pipeline

Data Warehouse β†’ Extract β†’ Load into β†’ CRM / Marketing Tool
(Insights pushed back to operational systems)

4.4 Pipeline Design Principles

  1. Idempotency β€” running the pipeline multiple times gives the same result (no duplicates).
  2. Fault Tolerance β€” pipeline recovers from failures without data loss.
  3. Modularity β€” each stage is independent and replaceable.
  4. Monitoring β€” every stage logs metrics and alerts on failure.
  5. Scalability β€” pipeline scales with data volume automatically.

Section 5: Analytical Operations

5.1 What are Analytical Operations?

Analytical operations are computations performed on data to derive insights, patterns, and knowledge.


5.2 Types of Analytical Operations

1. Descriptive Analytics

  • Describes what happened in the past.
  • Summary statistics: mean, median, mode, min, max.
  • Visualizations: charts, dashboards.
"Total sales in Q1 2026 were β‚Ή4.5 crore."

2. Diagnostic Analytics

  • Explains why something happened.
  • Drill-down analysis, correlation analysis.
"Sales dropped in March because of a supply chain disruption."

3. Predictive Analytics

  • Predicts what will happen in the future.
  • Uses ML models, regression, time-series forecasting.
"Sales in Q2 2026 are predicted to grow by 12%."

4. Prescriptive Analytics

  • Recommends what action to take.
  • Optimization algorithms, simulation.
"To maximize sales, increase inventory in Zone A by 20%."

Analytics Maturity Ladder:

   Prescriptive  ← What should we do?       (Most complex, most value)
       β–²
   Predictive    ← What will happen?
       β–²
   Diagnostic    ← Why did it happen?
       β–²
   Descriptive   ← What happened?            (Simplest, least value)

5.3 OLAP Operations

OLAP (Online Analytical Processing) is a set of operations for multidimensional data analysis.

OLAP Cube:

         Time
          β”‚
          β”‚  ──── Sales Data ────
          β”‚
          └─────────────────────► Region
         /
        /
     Product

Key OLAP Operations:

OperationDescriptionExample
Roll-upAggregate to higher levelDay β†’ Month β†’ Year
Drill-downGo to finer detailYear β†’ Month β†’ Day
SliceFix one dimension, view othersSales for 2025 only
DiceFix multiple dimensionsSales in 2025, in North region
PivotRotate the cube (change view)Show products as rows, regions as columns

Section 6: Aggregation Operations

6.1 What is Aggregation?

Definition:

Aggregation is the process of combining multiple data values into a single summary value. It reduces a large dataset to meaningful statistics.


6.2 Common Aggregation Functions

FunctionDescriptionExample
COUNTNumber of recordsCOUNT(*) = 1000 rows
SUMTotal of valuesSUM(sales) = β‚Ή50,000
AVGAverage valueAVG(age) = 25.3
MINMinimum valueMIN(price) = β‚Ή99
MAXMaximum valueMAX(price) = β‚Ή9,999
STDDEVStandard deviationSpread of salaries
MEDIANMiddle valueMedian income
GROUP BYAggregate per groupSales per region

6.3 Aggregation in Big Data Systems

MapReduce Aggregation

MAP:     Emit (region, sales_amount) for each transaction
REDUCE:  For each region, SUM all sales_amounts
OUTPUT:  (North, 50000), (South, 35000), (East, 42000)

Spark Aggregation

df.groupBy("region").agg(
    F.sum("sales").alias("total_sales"),
    F.count("order_id").alias("order_count"),
    F.avg("sales").alias("avg_sales")
)

Hive SQL Aggregation

SELECT region,
       SUM(sales)   AS total_sales,
       COUNT(*)     AS num_orders,
       AVG(sales)   AS avg_order
FROM sales_table
GROUP BY region
ORDER BY total_sales DESC;

6.4 Aggregation Challenges in Big Data

ChallengeSolution
Data skewOne key has far more records than others β†’ slow reducer
Memory overflowAggregation result too large for RAM
Approximate aggregationExact count of billions is slow
Streaming aggregationContinuous data, can't wait for all

Section 7: High-Level Operations

7.1 What are High-Level Operations?

High-level operations in Big Data refer to complex, multi-step data transformations and analyses that go beyond simple CRUD or aggregation. They are often expressed in declarative languages and executed as optimized distributed jobs.


7.2 Key High-Level Operations

1. Join Operations

Combining data from two or more datasets.

Join TypeDescription
Inner JoinOnly matching records from both datasets
Left JoinAll records from left + matching from right
Right JoinAll records from right + matching from left
Full Outer JoinAll records from both, nulls where no match
Cross JoinEvery combination (Cartesian product)

Distributed Join Strategies:

  • Broadcast Join β€” small dataset sent to all nodes (avoids shuffle).
  • Sort-Merge Join β€” both datasets sorted, then merged (large datasets).
  • Hash Join β€” one dataset hashed, other probes the hash table.

2. Sorting and Ordering

  • Distributed sorting requires shuffle β€” moving data between nodes.
  • Total Order Sorting β€” globally sorted across all nodes (expensive).
  • Partial Sort β€” sorted within each partition (cheaper).

3. Sampling

  • Taking a representative subset of data for faster analysis or testing.
Types:
  Random Sampling      β†’ Pick N% of rows randomly
  Stratified Sampling  β†’ Maintain class proportions
  Reservoir Sampling   β†’ Sample from a stream of unknown size

4. Filtering and Projection

  • Filtering (WHERE) β€” keep only rows matching a condition.
  • Projection (SELECT columns) β€” keep only needed columns.

In columnar formats (Parquet), projection pushdown means only the required columns are read from disk β€” huge performance win.


5. Deduplication

  • Remove duplicate records from a large dataset.
  • Challenging at scale β€” cannot load everything into memory.
Techniques:
  Exact dedup     β†’ Hash each row, remove identical hashes
  Fuzzy dedup     β†’ Use similarity (e.g., "Deepak" = "Deepk")
  Bloom filters   β†’ Probabilistic set membership check

6. Windowing / Sliding Window Operations

Used in stream processing to analyze data within a time window.

Tumbling Window:   Fixed, non-overlapping windows
  [0–10s] [10–20s] [20–30s]

Sliding Window:    Overlapping windows
  [0–10s] [5–15s] [10–20s]

Session Window:    Based on activity gaps
  (group events close in time, split by inactivity)

Example:

Count fraud alerts in the last 5 minutes, updated every 1 minute.
β†’ Sliding window: size=5min, slide=1min

Section 8: Big Data Workflow Management

8.1 What is Workflow Management?

Definition:

Big Data Workflow Management is the orchestration and scheduling of complex, multi-step data processing jobs to ensure they run in the correct order, handle failures gracefully, and complete reliably.

A workflow (also called a DAG β€” Directed Acyclic Graph) defines the sequence and dependencies of tasks.

DAG Example:
  Extract_Data β†’ Clean_Data β†’ Transform β†’ Train_Model β†’ Evaluate β†’ Deploy
       β”‚               β”‚            β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          (no cycles allowed)

8.2 Why Workflow Management is Needed

  • Big Data pipelines have many dependent steps β€” step 3 can't run until step 2 finishes.
  • Jobs run on a schedule (daily, hourly).
  • Failures need to be detected and retried automatically.
  • Teams need visibility into what is running, what failed, and why.

8.3 Apache Airflow

Apache Airflow is the most popular open-source workflow orchestration tool for data pipelines.

Key Concepts:

ConceptDescription
DAGDirected Acyclic Graph β€” the workflow definition
TaskOne unit of work (e.g., run a Spark job)
OperatorTemplate for a task type (PythonOperator, BashOperator, SparkOperator)
SchedulerTriggers DAGs on schedule
ExecutorRuns the actual tasks (LocalExecutor, CeleryExecutor)
Web UIDashboard to monitor DAGs and task status

Example DAG (simplified):

with DAG("daily_etl", schedule_interval="@daily") as dag:
    extract = PythonOperator(task_id="extract", python_callable=run_extract)
    transform = SparkSubmitOperator(task_id="transform", application="transform.py")
    load = PythonOperator(task_id="load", python_callable=run_load)

    extract >> transform >> load   # define order

8.4 Apache Oozie

Apache Oozie is a workflow scheduler specifically for Hadoop jobs.

  • Defines workflows in XML.
  • Supports MapReduce, Hive, Pig, Sqoop jobs.
  • Coordinator jobs for time-based scheduling.

8.5 Other Workflow Tools

ToolTypeKey Feature
Apache AirflowGeneral orchestrationPython DAGs, rich UI
Apache OozieHadoop-specificXML workflows, Hadoop-native
Apache NiFiData flow managementDrag-and-drop pipeline builder
PrefectModern orchestrationCloud-native, Python
DagsterData orchestrationStrong typing, testing
LuigiBatch orchestrationSimple Python tasks
AWS Step FunctionsCloud workflowServerless state machine
dbtTransformationSQL-based transform DAGs

8.6 Monitoring and Observability

A workflow system must also support monitoring:

MetricWhat to track
Job success rate% of pipelines completing without failure
LatencyHow long each stage takes
Data volumeRecords processed per job
Queue depthKafka lag β€” how far behind consumers are
Resource usageCPU, memory, disk on cluster nodes
Data quality% of records passing validation

Tools: Grafana, Prometheus, Apache Ambari, AWS CloudWatch, Datadog.


Section 9: Apache Hive β€” Data Warehouse on Hadoop

PYQ: What is Hive? Explain Hive features, Hive integration and workflow and architecture in detail. (2023, 15 marks)
PYQ: Write short note on Hive Query Language (HQL). (2023, 15 marks)

9.1 What is Hive?

  • Apache Hive is a data warehouse infrastructure tool to process structured data in Hadoop.
  • It resides on top of Hadoop to summarize Big Data and make querying and analyzing easy.
  • Originally developed by Facebook, later open-sourced by the Apache Software Foundation as Apache Hive.
  • Used by companies like Amazon (in Amazon Elastic MapReduce).

9.2 Features of Hive

  1. Indexes β€” provides indexes (including bitmap indexes since v0.10) to accelerate queries.
  2. Metadata in RDBMS β€” stores metadata in a relational database, which reduces the time required for semantic checks during query execution.
  3. Built-in User-Defined Functions (UDFs) β€” comes with UDFs to manipulate strings, dates, and data-mining tools; the UDF set is extensible to handle cases not covered by predefined functions.
  4. Algorithms for compressed data β€” supports operations on compressed data using DEFLATE, BWT, snappy, etc.
  5. Schema in DB, Data in HDFS β€” stores schemas in a database and processes the actual data into HDFS.
  6. OLAP-oriented β€” built primarily for Online Analytical Processing (OLAP) workloads.
  7. Query Language β€” provides Hive Query Language (HVL / HiveQL) β€” a SQL-like language for querying.

9.3 Hive Integration in Big Data

"Hive integration" refers to using Apache Hive β€” a data warehouse system built on Hadoop β€” to query and analyze large datasets stored in distributed file systems.

  • Lets users easily access and manipulate massive amounts of data using familiar SQL-like syntax (HiveQL) instead of writing complex MapReduce programs.
  • Critical for Big Data analysis inside the Hadoop ecosystem because it lowers the barrier between data analysts (who know SQL) and the underlying distributed engine.

9.4 Workflow of Hive

  • Hive operates in two modes:
    • Interactive mode β†’ commands go directly to the Hive shell.
    • Non-interactive mode β†’ executes code in console mode (script execution).
  • Data is divided into partitions, which in turn split into buckets.
  • Execution plans are produced based on aggregation patterns and data skew.
  • Easily processes large-scale information and provides rich user interfaces (Web UI, CLI).

9.5 Architecture of Hive

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   USER INTERFACES                                          β”‚
β”‚   Web UI  |  Hive Command Line  |  HD Insight (Windows)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”‚ Meta Store ◄──► HiveQL Processβ”‚
       β”‚                Engine         β”‚
       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
              β”‚ Execution Engineβ”‚
              β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
                β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                β”‚  MAP REDUCE β”‚
                β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚ HDFS  /  HBASE       β”‚
            β”‚ (Data Storage)       β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Architectural Units:

Unit NameOperation
User InterfaceHive Web UI, Hive CLI, Hive HD Insight β€” create the interaction between user and HDFS
Meta StoreStores schema / metadata of tables and databases, columns and their data types, and HDFS mapping
Hive QL Process EnginePerforms SQL-like querying on the schema info in the Metastore; replaces the traditional approach of writing MapReduce programs in Java
Execution EngineConjunction of the HiveQL Process Engine and MapReduce; processes the query and generates results similar to MapReduce
HDFS or HBASEUnderlying data storage techniques used by Hive

9.6 Data Flow Steps β€” Hive Query Execution (9 Steps)

 UI β†’ Driver β†’ Compiler β†’ Meta Store β†’ Compiler β†’ Driver
                                                    β”‚
                                                    β–Ό
                          Execution Engine ──► MapReduce/HDFS
                                                    β”‚
                                                    β–Ό
                                                 Results
  1. Execute the Query from the UI (CLI / Web UI / JDBC).
  2. Get a plan from the driver β€” the driver creates a task DAG (Directed Acyclic Graph) of stages.
  3. Get metadata request sent from the compiler to the Meta Store.
  4. Sent metadata is returned from the Meta Store to the compiler.
  5. Sending the plan back to the driver after compilation and optimization.
  6. Execute the plan in the execution engine.
  7. Fetching results for the appropriate user query.
  8. Sending results bi-directionally between the execution engine and the driver / UI.
  9. The execution engine performs processing in HDFS with MapReduce and fetches results from data nodes created by the job tracker β€” acting as the connector between Hive and Hadoop.

9.7 Hive Query Language (HQL)

HiveQL is a SQL-like query language used by Hive to process and analyze structured data stored in a Metastore.

  • SELECT statement β€” retrieves data from a table.
  • WHERE clause β€” filters data using a condition and gives a finite result; works similar to a condition in SQL; built-in operators and functions are used to generate an expression that fulfils the condition.

Syntax:

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[HAVING having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]]
[LIMIT number];

Common HQL Clauses:

ClausePurpose
SELECTPick columns to project
FROMSource table
WHERERow-level filter
GROUP BYGroup rows for aggregation
HAVINGFilter on aggregated groups
CLUSTER BYDistribute + sort by same column
DISTRIBUTE BYRoute rows to reducers
SORT BYSort within each reducer
LIMITRestrict number of rows returned

Section 10: Apache Pig β€” Data Flow Language

PYQ: Write short note on Pig architecture and commands. (2023, 15 marks)

10.1 What is Pig?

  • Apache Pig is a high-level data processing language used to analyze data in Hadoop.
  • The language used is Pig Latin β€” provides rich data types and operators for various operations.
  • A programmer writes a Pig script in Pig Latin and executes it using one of the execution mechanisms: Grunt Shell, UDFs, or Embedded mode.
  • Scripts go through a series of transformations applied by the Pig framework to produce the desired output.
  • Internally, Apache Pig converts these scripts into a series of MapReduce jobs that run on Hadoop.

10.2 Apache Pig Architecture

                Pig Latin Scripts
                        β”‚
                        β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ Apache Pig ─────────────┐
   β”‚   Grunt Shell   |   Pig Server      β”‚
   β”‚                                      β”‚
   β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
   β”‚            β”‚   Parser    β”‚           β”‚
   β”‚            β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜           β”‚
   β”‚                   β–Ό                  β”‚
   β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
   β”‚            β”‚  Optimizer  β”‚           β”‚
   β”‚            β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜           β”‚
   β”‚                   β–Ό                  β”‚
   β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
   β”‚            β”‚  Compiler   β”‚           β”‚
   β”‚            β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜           β”‚
   β”‚                   β–Ό                  β”‚
   β”‚         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
   β”‚         β”‚  Execution Engine  β”‚       β”‚
   β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό
                  MapReduce
                       β”‚
                       β–Ό
                    Hadoop
                       β”‚
                       β–Ό
                     HDFS

Components:

  1. Parser β€” handles Pig scripts; performs syntax checking, type checking, and other validations. The output is a DAG (Directed Acyclic Graph) representing the Pig Latin statements and logical operators (nodes = logical operators, edges = data flows).
  2. Optimizer β€” receives the logical plan (DAG) and carries out logical optimizations like projection and pushdown to reduce processed data.
  3. Compiler β€” compiles the optimized logical plan into a series of MapReduce jobs.
  4. Execution Engine β€” submits the MapReduce jobs to Hadoop in sorted order; the jobs execute on Hadoop to produce the desired results.

10.3 Basic Pig Commands

  1. fs -ls β€” lists all files in HDFS.
    grunt> fs -ls
    
  2. clear β€” clears the interactive Grunt shell.
    grunt> clear
    
  3. history β€” shows previously executed commands.
    grunt> history
    
  4. Reading data β€” load data into a relation:
    grunt> college_students = LOAD 'hdfs://localhost:9000/pig_data/college_data.txt'
           USING PigStorage(',')
           as (id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);
    
    • PigStorage() loads and stores data as structured text files.
  5. Storing data β€” write a relation back to HDFS:
    grunt> STORE college_students INTO 'hdfs://localhost:9000/pig_Output/' USING PigStorage(',');
    
  6. Dump β€” displays results on the screen; useful in debugging.
    grunt> Dump college_students;
    
  7. Describe β€” views a relation's schema.
    grunt> describe college_students;
    
  8. Explain β€” reviews the logical, physical, and MapReduce execution plans.
    grunt> explain college_students;
    
  9. Illustrate β€” step-by-step execution of statements.
    grunt> illustrate college_students;
    

10.4 Intermediate Pig Commands

  1. Group β€” groups data with the same key.
    grunt> group_data = GROUP college_students by firstname;
    
  2. Cogroup β€” like Group, but works with more than one relation.
  3. Join β€” combines two or more relations; can be self-join, inner-join, or outer-join.
    grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
    
  4. Cross β€” Cartesian product of two or more relations.
    grunt> cross_data = CROSS customers, orders;
    
  5. Union β€” merges two relations; the columns and domains of both relations must be identical.
    grunt> student = UNION student1, student2;
    

10.5 Advanced Pig Commands

  1. Filter β€” filters tuples based on a condition.
    grunt> filter_data = FILTER college_students BY city == 'Chennai';
    
  2. Distinct β€” removes redundant tuples.
    grunt> distinct_data = DISTINCT college_students;
    
  3. Foreach β€” generates data transformation based on column data.
    grunt> foreach_data = FOREACH student_details GENERATE id, age, city;
    
  4. Order by β€” displays the result in sorted order.
    grunt> order_by_data = ORDER college_students BY age DESC;
    
  5. Limit β€” gets a limited number of tuples from a relation.
    grunt> limit_data = LIMIT student_details 4;
    

Quick Revision Points

Processing Types:

TypeLatencyTools
BatchHoursHadoop, Hive, Spark batch
Streamms–secondsKafka, Flink, Spark Streaming
InteractiveSecondsPresto, Spark SQL, BigQuery

Lambda Architecture:

  • Batch Layer (historical) + Speed Layer (real-time) + Serving Layer (merge).

OLAP Operations:

  • Roll-up, Drill-down, Slice, Dice, Pivot.

Analytics Types:

  • Descriptive β†’ Diagnostic β†’ Predictive β†’ Prescriptive.

Aggregation Functions:

  • COUNT, SUM, AVG, MIN, MAX, STDDEV, GROUP BY.

High-Level Operations:

  • Join, Sort, Sample, Filter, Project, Deduplicate, Window.

Windowing:

  • Tumbling (non-overlapping), Sliding (overlapping), Session (activity-based).

Workflow Management:

  • DAG = task dependency graph.
  • Apache Airflow = most popular orchestrator.
  • Oozie = Hadoop-specific.
  • NiFi = drag-and-drop flow management.

MapReduce Phases:

  • 7 phases: Input β†’ Map β†’ Intermediate Keys β†’ Combiner (optional) β†’ Shuffle & Sort β†’ Reduce β†’ Output.
  • Combiner = local mini-reducer per mapper; reduces shuffle volume; optional.
  • Shuffle & Sort = group all intermediate values by key before reducer.

Word Count Algorithm:

  • MAP(doc_name, doc) β†’ for each word w in doc: emit(w, 1).
  • REDUCE(word, values) β†’ sum = Ξ£ values; emit(word, sum).
  • Sample: "cat dog cat bird dog cat" β†’ (cat,3), (dog,2), (bird,1).

Real-Time Analytics β€” 5 Component Areas:

  1. Data Ingestion (Kafka, Kinesis, RabbitMQ)
  2. Data Processing Engines (Flink, Storm, Spark Streaming)
  3. Real-Time Querying (Druid, ClickHouse, Redshift)
  4. Data Storage (InfluxDB, MongoDB, Cassandra, HBase)
  5. Visualization & Alerts (Grafana, Kibana, Tableau, Power BI)

Real-Time Analytics Technologies:

  • Streaming data processing, Event streaming platforms, Apache Kafka, Apache Storm, Apache Flink, Apache Pinot, Machine Learning, Event-Driven Architecture (EDA), Data Pipeline.

Apache Hive β€” Key Points:

  • Data warehouse on top of Hadoop for structured data; originally by Facebook.
  • 7 features: Indexes, RDBMS metadata, UDFs, compression algorithms (DEFLATE/BWT/snappy), schema in DB, OLAP-built, HiveQL.
  • Architecture units: User Interface β†’ Meta Store + HiveQL Process Engine β†’ Execution Engine β†’ MapReduce β†’ HDFS/HBase.
  • Workflow: interactive vs non-interactive mode; data β†’ partitions β†’ buckets.
  • 9 data-flow steps for query execution.

HiveQL (HQL):

  • SQL-like; SELECT … FROM … WHERE … GROUP BY … HAVING … CLUSTER BY / DISTRIBUTE BY / SORT BY … LIMIT.

Apache Pig β€” Key Points:

  • High-level data flow language; written in Pig Latin; internally converted to MapReduce jobs.
  • Architecture: Parser β†’ Optimizer β†’ Compiler β†’ Execution Engine β†’ MapReduce β†’ Hadoop β†’ HDFS.
  • Basic commands: fs -ls, clear, history, LOAD, STORE, Dump, Describe, Explain, Illustrate.
  • Intermediate: Group, Cogroup, Join, Cross, Union.
  • Advanced: Filter, Distinct, Foreach, Order by, Limit.

Expected Exam Questions

15-Mark Questions:

  1. Explain MapReduce working. Write a program for Word Count using MapReduce. (2023)
  2. What is Hive? Explain Hive features, Hive integration and workflow and architecture in detail. (2023)
  3. Write short note on Hive Query Language (HQL). (2023)
  4. Write short note on Pig architecture and commands. (2023)
  5. What is Real Time Analytics? Discuss their technologies in detail. (2022)
  6. Describe the key components of Big Data processing pipelines. How do these components work together to ingest, process, and analyze large volumes of data? (2024)

Short Answer Questions (2.5 marks):

  1. Explain MapReduce. (2022)
  2. What do you mean by Big Data Processing pipeline? (2024)

These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!