BDASemester 8

Unit 2: Data Repositories and Big Data Platforms

RDBMS, NoSQL, Data Marts, Data Lakes, ETL, Data Pipelines, Big Data Processing Tools, Modern Data Ecosystem, Types of Data, File Formats, and Service Bindings.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Data Repositories and Big Data Platforms: RDBMS, NoSQL, Data Marts, Data Lakes, ETL, and Data Pipelines, Foundations of Big Data, Big Data Processing Tools, Modern Data Ecosystem, Key Players, Types of Data, Understanding Different Types of File Formats, Sources of Data Using Service Bindings.


🎯 PYQ Analysis for Unit 2

PYQs will be added after analysis β€” check back soon.


Section 1: RDBMS

1.1 What is RDBMS?

Definition:

A Relational Database Management System (RDBMS) is a database system that stores data in tables (relations) with rows and columns, and uses SQL (Structured Query Language) to manage and query data. Relationships between tables are established using primary keys and foreign keys.

Simple Analogy:

Think of an RDBMS like a well-organized set of Excel spreadsheets where each sheet is a table, and the sheets are linked to each other using unique IDs.


1.2 Key Concepts of RDBMS

ConceptDefinition
TableCollection of rows and columns (like a spreadsheet)
Row (Tuple)One record / data entry
Column (Attribute)One field / property
Primary KeyUnique identifier for each row
Foreign KeyColumn that links to primary key of another table
SchemaFixed structure defining table columns and data types
SQLLanguage used to create, read, update, delete data

1.3 ACID Properties

RDBMS guarantees data integrity through ACID properties:

PropertyMeaningExample
AtomicityTransaction is all-or-nothingBank transfer: debit + credit both happen or neither does
ConsistencyDB always moves from one valid state to anotherBalance can't go negative if rule says so
IsolationConcurrent transactions don't interfereTwo users buying last ticket β€” only one succeeds
DurabilityCommitted data is permanently savedData survives power failure

1.4 Advantages and Limitations of RDBMS

Advantages:

βœ… Structured and organized data.
βœ… Powerful querying with SQL.
βœ… Strong ACID guarantees.
βœ… Well-understood, mature technology.
βœ… Excellent for transactional (OLTP) workloads.

Limitations in Big Data context:

❌ Does not scale horizontally (designed for vertical scaling).
❌ Fixed schema β€” cannot handle unstructured data (images, JSON, logs).
❌ Poor performance with very large datasets (billions of rows).
❌ Joins across large tables are slow.
❌ Not designed for real-time streaming data.

Popular RDBMS: MySQL, PostgreSQL, Oracle, Microsoft SQL Server, SQLite.


Section 2: NoSQL

2.1 What is NoSQL?

Definition:

NoSQL (Not Only SQL) refers to a class of database systems that do not use the traditional relational table model. They are designed to handle:

  • Unstructured or semi-structured data
  • Massive scale (horizontal scaling)
  • High velocity reads/writes
  • Flexible schemas (no fixed column structure)

Why NoSQL for Big Data?

Traditional RDBMS struggles when:

  • Data has no fixed structure (social media posts, sensor readings).
  • Billions of records need to be read/written per second.
  • Data needs to be spread across hundreds of servers.

2.2 Types of NoSQL Databases

1. Key-Value Stores

  • Simplest type β€” data stored as (key, value) pairs.
  • Like a dictionary / hashmap.
  • Very fast lookup by key.
Key         Value
───────────────────────────
user:001  β†’ {"name":"Deepak","age":22}
user:002  β†’ {"name":"Ankit","age":23}
session:x β†’ "active"

Examples: Redis, DynamoDB, Riak
Use cases: Caching, session management, leaderboards.


2. Document Stores

  • Data stored as documents (usually JSON or BSON).
  • Each document can have a different structure (schema-less).
  • Documents are grouped in collections (like tables in RDBMS).
{
  "_id": "001",
  "name": "Deepak",
  "courses": ["ML", "BDA"],
  "address": {
    "city": "Jaipur",
    "pin": "302001"
  }
}

Examples: MongoDB, CouchDB, Firebase Firestore
Use cases: Content management, catalogs, user profiles, real-time apps.


3. Column-Family Stores (Wide-Column)

  • Data stored in rows, but each row can have different columns.
  • Columns are grouped into column families.
  • Optimized for reading/writing specific columns across many rows.
Row Key   | Personal Info         | Scores
──────────┼───────────────────────┼──────────────
user001   | name="Deepak",age=22  | math=90,sci=85
user002   | name="Ankit"          | math=75

Examples: Apache HBase, Apache Cassandra, Google Bigtable
Use cases: Time-series data, IoT, analytics, messaging systems.


4. Graph Databases

  • Data stored as nodes (entities) and edges (relationships).
  • Best for data where relationships are the primary value.
(Deepak) ──FRIENDS──► (Ankit)
(Deepak) ──ENROLLED──► (ML Course)
(Ankit)  ──ENROLLED──► (BDA Course)

Examples: Neo4j, Amazon Neptune, ArangoDB
Use cases: Social networks, fraud detection, recommendation engines, knowledge graphs.


2.3 RDBMS vs NoSQL

FeatureRDBMSNoSQL
Data ModelTables (rows & columns)Key-value, document, column, graph
SchemaFixed (rigid)Flexible (dynamic)
ScalingVertical (scale up)Horizontal (scale out)
Query LanguageSQLVaries (no standard)
ACIDFull ACIDOften BASE (eventual consistency)
Data TypeStructured onlyStructured, semi, unstructured
PerformanceSlower at huge scaleFaster for simple queries at scale
ExamplesMySQL, OracleMongoDB, Cassandra, Redis
Use CaseBanking, ERPSocial media, IoT, real-time

BASE properties (NoSQL alternative to ACID):

PropertyMeaning
Basically AvailableSystem is always available (may return stale data)
Soft StateState may change over time without input
Eventual ConsistencyData will eventually be consistent across nodes

Section 3: Data Warehouses, Data Marts, and Data Lakes

3.1 Data Warehouse

Definition:

A Data Warehouse is a large, centralized repository that stores integrated, historical, and structured data from multiple sources, optimized for analytical queries (OLAP) rather than transactional processing (OLTP).

  Source Systems                     Data Warehouse
  ──────────────                     ───────────────
  Sales DB   ──┐
  HR DB      ──┼──► ETL Process ──► Centralized   ──► BI Reports
  Finance DB ───                    Warehouse         Analytics
  CRM        β”€β”€β”˜                    (OLAP)            Dashboards

Key Characteristics:

  • Subject-oriented β€” organized around subjects (sales, customers).
  • Integrated β€” data from multiple sources combined.
  • Non-volatile β€” data is not deleted; historical data is kept.
  • Time-variant β€” data is tracked over time (time dimension always present).

Popular tools: Amazon Redshift, Google BigQuery, Snowflake, Teradata.


3.2 Data Marts

Definition:

A Data Mart is a smaller, focused subset of a data warehouse that serves the needs of a specific department or business unit.

                 β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                 β”‚  Data Warehouse β”‚
                 β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
          β–Ό               β–Ό               β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ Sales Mart  β”‚ β”‚ HR Mart     β”‚ β”‚ Finance Martβ”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   (Sales team)    (HR team)        (Finance team)

Types:

TypeDescription
Dependent MartCreated from the central data warehouse
Independent MartCreated directly from source systems
Hybrid MartCombination of both

Data Warehouse vs Data Mart:

Data WarehouseData Mart
ScopeEnterprise-wideDepartment-specific
SizeVery large (TB–PB)Smaller (GB–TB)
UsersAll departmentsOne team
Build timeMonthsWeeks

3.3 Data Lakes

Definition:

A Data Lake is a centralized repository that stores raw data in its original format (structured, semi-structured, and unstructured) at any scale. Data is stored as-is until it is needed.

Key Idea: Store everything now, figure out the schema when you read it (schema-on-read).

  Sources                     Data Lake
  ───────                     ─────────
  IoT sensors ──┐
  Web logs    ──┼──► Store raw ──► Analytics
  Social media───    data as-is    ML models
  Videos      ───    (no schema)   Data science
  CSVs        β”€β”€β”˜

Data Warehouse vs Data Lake:

Data WarehouseData Lake
Data TypeStructured onlyAll types
SchemaSchema-on-writeSchema-on-read
UsersBusiness analystsData scientists
ProcessingPre-processedRaw data
CostHigherLower (object storage)
AgilityLess flexibleHighly flexible
RiskLow (clean data)"Data Swamp" if unmanaged

Data Swamp: A data lake that becomes disorganized and unusable because data is dumped in without governance.

Popular platforms: AWS S3 + Glue, Azure Data Lake Storage, Google Cloud Storage, Databricks Delta Lake.


Section 4: ETL and Data Pipelines

4.1 What is ETL?

Definition:

ETL stands for Extract, Transform, Load β€” it is the process of moving data from source systems into a data warehouse or data lake for analysis.

  Source Systems          ETL Process              Target
  ──────────────         ─────────────            ────────
  Database A   ──► EXTRACT ──► TRANSFORM ──► LOAD ──► Data Warehouse
  Database B   ──►           (clean,              ──► Data Lake
  API / Files  ──►            convert,            ──► Analytics DB
                              aggregate)

4.2 Three Phases of ETL

Phase 1: Extract

  • Pull raw data from multiple source systems.
  • Sources: RDBMS, flat files (CSV), APIs, web scraping, IoT sensors.
  • Data may be in different formats and quality.
  • Full extraction (all data) or incremental extraction (only changed data).

Phase 2: Transform

  • The most complex phase.
  • Clean and shape data into a consistent format for analysis.

Common transformations:

OperationDescription
CleaningHandle missing values, remove duplicates
FilteringSelect only relevant rows/columns
MappingConvert data types (string β†’ date)
AggregationGroup and summarize (SUM, AVG, COUNT)
JoiningCombine data from multiple sources
SortingOrder data
EncodingConvert categories to codes
NormalizationStandardize numeric ranges

Phase 3: Load

  • Write the transformed data into the target system (data warehouse/lake).
  • Full load β€” overwrite everything.
  • Incremental load β€” only add new/changed records (delta load).

4.3 ELT vs ETL

Modern cloud systems sometimes use ELT (Extract β†’ Load β†’ Transform):

ETLELT
OrderTransform before loadingLoad raw, transform later
Where transformedSeparate transformation serverInside the target system
Best forTraditional data warehousesCloud data lakes / warehouses
FlexibilityLess (schema must match)More (raw data preserved)
ExamplesInformatica, Talenddbt, BigQuery, Snowflake

4.4 Data Pipelines

Definition:

A Data Pipeline is an automated sequence of steps (stages) that moves data from one or more sources to a destination, performing transformations along the way.

ETL is a type of data pipeline, but pipelines can also include:

  • Data validation
  • Monitoring and alerting
  • Scheduling
  • Error handling and retries

Pipeline Types:

TypeDescriptionExample
Batch PipelineProcesses data in chunks at scheduled intervalsNightly ETL job
Streaming PipelineProcesses data continuously in real timeLive fraud detection
Lambda ArchitectureCombines batch + streaming layersTwitter analytics
Kappa ArchitectureStreaming only (no separate batch layer)Kafka-based systems

Popular Data Pipeline Tools:

ToolTypeUse Case
Apache AirflowBatch orchestrationScheduling complex workflows
Apache KafkaStreamingReal-time event streaming
Apache NiFiData flowDrag-and-drop pipeline builder
AWS GlueCloud ETLServerless ETL on AWS
dbtTransformationSQL-based data transformation
SparkBatch + StreamLarge-scale data processing

Section 5: Big Data Processing Tools

5.1 Apache Hadoop

Hadoop is an open-source framework for storing and processing Big Data in a distributed environment using clusters of commodity hardware.

Core Components:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              Hadoop Ecosystem         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  HDFS   β”‚  Distributed file storage   β”‚
β”‚  YARN   β”‚  Resource manager           β”‚
β”‚  MapReduceβ”‚ Batch processing engine   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
ComponentRole
HDFSDistributed file system (storage)
YARNCluster resource manager (who gets CPU/memory)
MapReduceParallel batch processing model

5.2 Apache Spark

Apache Spark is a fast, in-memory distributed processing engine. It is the successor to MapReduce for most workloads.

Key Advantage over MapReduce:

MapReduce: Read from disk β†’ Process β†’ Write to disk β†’ Read β†’ Process β†’ Write...
           (every step goes to disk = SLOW)

Spark:     Read from disk β†’ Process in RAM β†’ Process in RAM β†’ Write to disk
           (keeps data in memory = UP TO 100x FASTER)

Spark Components:

ComponentPurpose
Spark CoreBasic processing engine
Spark SQLSQL queries on structured data
Spark StreamingReal-time stream processing
MLlibMachine learning library
GraphXGraph processing

5.3 Other Key Tools

ToolCategoryPurpose
Apache HiveQuerySQL-like queries on HDFS (batch)
Apache PigScriptingData flow using Pig Latin language
Apache HBaseNoSQL DBReal-time read/write on HDFS
Apache KafkaMessagingHigh-throughput event streaming
Apache StormStreamReal-time stream processing
Apache FlinkStreamLow-latency stateful stream processing
Apache SqoopIngestionImport/export between RDBMS and HDFS
Apache FlumeIngestionCollect and move log data to HDFS
ZookeeperCoordinationDistributed coordination service

Section 6: Modern Data Ecosystem and Key Players

6.1 Modern Data Ecosystem

The Modern Data Ecosystem is the complete landscape of tools, technologies, and processes that organizations use to collect, store, process, and analyze data.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Modern Data Ecosystem                  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Data        β”‚  Storage     β”‚  Processing  β”‚  Consume   β”‚
β”‚  Sources     β”‚              β”‚              β”‚            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Databases    β”‚ Data Lakes   β”‚ Spark        β”‚ BI Tools   β”‚
β”‚ APIs         β”‚ Data Warehousesβ”‚ Kafka      β”‚ ML Models  β”‚
β”‚ IoT devices  β”‚ Cloud Storageβ”‚ Flink        β”‚ Dashboards β”‚
β”‚ Web logs     β”‚ HDFS         β”‚ MapReduce    β”‚ Reports    β”‚
β”‚ Social media β”‚ NoSQL        β”‚ Airflow      β”‚ Apps       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6.2 Key Players in the Big Data Industry

Cloud Providers:

CompanyPlatformKey Services
Amazon (AWS)AWSS3, EMR, Redshift, Glue, Kinesis
GoogleGCPBigQuery, Dataflow, Pub/Sub, Dataproc
MicrosoftAzureAzure Data Lake, Synapse, HDInsight

Open Source / Software Vendors:

Company/ProjectContribution
Apache FoundationHadoop, Spark, Kafka, Hive, Flink
DatabricksManaged Spark, Delta Lake
ClouderaEnterprise Hadoop distribution
MongoDBNoSQL document database
ElasticElasticsearch for search & analytics
ConfluentManaged Kafka platform

BI and Analytics:

ToolUse
TableauData visualization
Power BIMicrosoft BI platform
LookerData exploration (Google)
GrafanaMonitoring dashboards

Section 7: Types of Data

7.1 Classification by Structure

Data in the Big Data world falls into three categories:

1. Structured Data

  • Has a predefined schema (fixed columns and data types).
  • Stored in rows and columns.
  • Easily searchable with SQL.
| ID  | Name    | Age | Salary  |
|-----|---------|-----|---------|
| 001 | Deepak  | 22  | 50000   |
| 002 | Ankit   | 23  | 55000   |

Examples: Bank transactions, HR records, inventory.
Storage: RDBMS, CSV files.
Volume: ~20% of all data generated.


2. Semi-Structured Data

  • Has some structure but no strict schema.
  • Uses tags, keys, or markers to separate elements.
  • More flexible than structured data.

JSON Example:

{
  "user": "Deepak",
  "age": 22,
  "skills": ["Python", "ML", "BDA"]
}

XML Example:

<student>
  <name>Deepak</name>
  <age>22</age>
</student>

Examples: JSON, XML, HTML, emails, log files, NoSQL documents.
Volume: ~5–10% of all data.


3. Unstructured Data

  • Has no predefined format or schema.
  • Cannot be stored directly in RDBMS.
  • Largest and fastest-growing category.

Examples:

  • Text: emails, social media posts, chat logs
  • Images: photos, medical scans, satellite images
  • Audio: call recordings, music
  • Video: surveillance footage, YouTube videos

Volume: ~80% of all data generated today.


7.2 Classification by Processing Time

TypeDescriptionExample
Historical / BatchStored data processed in chunksMonthly sales report
Real-time / StreamingData processed as it arrivesLive fraud alert
Near real-timeSmall delay (seconds/minutes)Dashboard refresh

7.3 Classification by Source

TypeDescriptionExample
Internal dataGenerated within organizationEmployee records
External dataFrom outside sourcesSocial media, open datasets
First-party dataCollected directly from customersWebsite clicks
Second-party dataShared from a partnerPartner CRM data
Third-party dataPurchased from data aggregatorsMarket research firms

Section 8: File Formats in Big Data

8.1 Why File Format Matters

Choosing the right file format affects:

  • Storage space (compression)
  • Query speed (columnar vs row-based)
  • Compatibility (which tools can read it)
  • Schema evolution (can you add fields later?)

8.2 Common Big Data File Formats

1. CSV (Comma-Separated Values)

name,age,city
Deepak,22,Jaipur
Ankit,23,Delhi
  • Simple, human-readable text format.
  • No compression (large file size).
  • No schema enforcement.
  • Best for: Small datasets, data exchange, spreadsheet imports.

2. JSON (JavaScript Object Notation)

{"name": "Deepak", "age": 22, "city": "Jaipur"}
{"name": "Ankit",  "age": 23, "city": "Delhi"}
  • Human-readable, flexible schema.
  • Supports nested structures.
  • Verbose (large file size).
  • Best for: APIs, web data, semi-structured data.

3. Parquet

  • Columnar storage format (stores data column by column, not row by row).
  • Highly compressed.
  • Optimized for analytical queries (reading specific columns).
Row-based (CSV):        Columnar (Parquet):
Row 1: Deepak,22,Jaipur     Name col:  Deepak, Ankit
Row 2: Ankit, 23, Delhi  β†’ Age col:   22, 23
Row 3: ...                  City col:  Jaipur, Delhi
  • Best for: Analytics on specific columns (most big data workloads).
  • Used by: Spark, Hive, Hadoop, BigQuery.

4. Avro

  • Row-based binary format.
  • Schema stored with the data (self-describing).
  • Supports schema evolution (can add/remove fields).
  • Best for: Data serialization, Kafka messaging, Hadoop ingestion.

5. ORC (Optimized Row Columnar)

  • Columnar format optimized for Apache Hive.
  • Better compression than Parquet in some cases.
  • Supports ACID transactions in Hive.
  • Best for: Hive workloads, data warehousing.

6. Sequence File

  • Hadoop-native binary format.
  • Stores key-value pairs.
  • Supports compression.
  • Best for: Intermediate MapReduce output.

8.3 File Format Comparison

FormatTypeCompressedSchemaBest For
CSVRow, textNoNoSimple exchange
JSONRow, textNoFlexibleAPIs, semi-structured
ParquetColumnar, binaryYesYesAnalytics (Spark, Hive)
AvroRow, binaryYesYes (embedded)Streaming, Kafka
ORCColumnar, binaryYesYesHive, data warehouse
SequenceRow, binaryOptionalNoHadoop MapReduce

Section 9: Sources of Data Using Service Bindings

9.1 What are Service Bindings?

Service Bindings (also called data source connectors or service integrations) are mechanisms that allow Big Data platforms to connect to and ingest data from external services and sources without manually moving files.


9.2 Types of Service Bindings / Data Sources

1. Database Connectors

  • Connect directly to RDBMS or NoSQL databases.
  • Tools: Apache Sqoop (RDBMS ↔ HDFS), Spark JDBC connector.
MySQL Database ──► Sqoop ──► HDFS / Data Lake
Oracle DB      ──► JDBC  ──► Spark

2. Message Queue / Streaming Services

  • Real-time data ingestion from event streams.
  • Tools: Apache Kafka, Amazon Kinesis, Google Pub/Sub.
Web events / Clicks ──► Kafka Topic ──► Spark Streaming ──► Analytics

3. REST APIs

  • Pull data from web services using HTTP requests.
  • Response usually in JSON or XML.
Twitter API ──► HTTP GET ──► JSON data ──► Data Lake
Weather API ──► HTTP GET ──► JSON data ──► Analysis

4. Cloud Storage Bindings

  • Direct access to cloud object storage.
  • AWS S3, Azure Blob Storage, Google Cloud Storage.
  • Spark / Hadoop can read files directly from S3 without downloading.
S3 bucket (CSV files) ──► Spark ──► Processed results ──► Redshift

5. File System / FTP / SFTP

  • Batch file ingestion from file servers.
  • Tools: Apache Flume, NiFi.
FTP Server (log files) ──► Flume ──► HDFS

6. IoT and Sensor Data

  • Devices send data using MQTT or HTTP protocols.
  • Platforms: AWS IoT, Azure IoT Hub.
Smart meter ──► MQTT ──► IoT Hub ──► Kafka ──► Analytics

7. Web Scraping

  • Extract data from websites automatically.
  • Tools: Scrapy, BeautifulSoup, Selenium.
Website ──► Scraper ──► JSON/HTML ──► Data Lake ──► Processing

9.3 Data Ingestion Summary

Source TypeProtocol / ToolUse Case
RDBMSSqoop, JDBCMigrate existing DB to HDFS
Event streamsKafka, KinesisReal-time logs, clicks
REST APIsHTTP, RequestsSocial media, weather, finance
Cloud storageS3, GCS, ADLSBulk file access
FilesFlume, NiFi, FTPLog files, batch CSVs
IoT devicesMQTT, IoT HubSensor readings
WebScrapy, SeleniumPublic web data

Quick Revision Points

Data Repositories:

RDBMS        β†’ Structured, SQL, ACID, vertical scale
NoSQL        β†’ Flexible, horizontal scale, BASE
  Types: Key-Value, Document, Column-Family, Graph
Data Warehouse β†’ Integrated, historical, OLAP
Data Mart    β†’ Subset of DW for one department
Data Lake    β†’ Raw data, schema-on-read, all types

ETL:

Extract β†’ Transform β†’ Load
  Extract:   Pull from sources
  Transform: Clean, map, aggregate
  Load:      Write to warehouse/lake

File Formats:

FormatTypeBest For
CSVRow, textSimple exchange
JSONRow, textAPIs
ParquetColumnarAnalytics (Spark)
AvroRow, binaryKafka, streaming
ORCColumnarHive

Types of Data:

  • Structured β€” fixed schema, SQL (20%)
  • Semi-structured β€” JSON, XML, flexible (5–10%)
  • Unstructured β€” images, video, text (80%)

Key Players:

  • Cloud: AWS, GCP, Azure
  • Open source: Apache (Hadoop, Spark, Kafka)
  • BI: Tableau, Power BI

Service Bindings:

  • DB connectors (Sqoop), Streams (Kafka), APIs (REST), Cloud (S3), IoT (MQTT), Web scraping.

Expected Exam Questions

PYQs will be added after analysis β€” check back soon.


These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!