BDASemester 8

Unit 3: Big Data Modeling and Management

Data Storage, Quality, Operations, Ingestion, Scalability, Security, Traditional DBMS vs Big Data Systems, Data Models β€” Structure, Operations, Constraints, Types.

Author: Deepak Modi
Last Updated: 2026-05-10

Syllabus:

Introduction to Big Data Modeling and Management: Data Storage, Data Quality, Data Operations, Data Ingestion, Scalability and Security, Traditional DBMS and Big Data Management Systems, Real Life Applications, Data Model: Structure, Operations, Constraints, Types of Big Data Model.


🎯 PYQ Analysis for Unit 3

PYQs will be added after analysis β€” check back soon.


Section 1: Data Storage in Big Data

1.1 What is Data Storage?

Definition:

Data Storage in the context of Big Data refers to the systems and technologies used to store massive volumes of diverse data reliably, efficiently, and in a way that supports large-scale processing and retrieval.


1.2 Types of Data Storage

1. File-Based Storage

  • Data stored as raw files on a distributed file system.
  • No schema enforced at write time.
  • Example: HDFS, Amazon S3, Azure Data Lake Storage.
  /data/
    β”œβ”€β”€ logs/
    β”‚     β”œβ”€β”€ 2026-01-01.log
    β”‚     └── 2026-01-02.log
    β”œβ”€β”€ images/
    └── transactions.csv

Best for: Raw data ingestion, data lakes, batch processing.


2. Relational Storage

  • Data stored in tables with fixed schema.
  • Uses SQL for querying.
  • Example: MySQL, PostgreSQL, Amazon Redshift (analytical RDBMS).

Best for: Structured data, OLTP, traditional reporting.


3. NoSQL Storage

  • Flexible schema for semi-structured and unstructured data.
  • Types: Key-Value, Document, Column-Family, Graph.
  • Example: MongoDB, Cassandra, HBase, Redis.

Best for: Real-time applications, flexible schema data, high write throughput.


4. In-Memory Storage

  • Data stored in RAM instead of disk.
  • Extremely fast access (microseconds vs milliseconds for disk).
  • Example: Redis, Memcached, Apache Spark (RDD caching).

Best for: Caching, real-time analytics, iterative ML algorithms.


5. Object Storage

  • Data stored as objects (file + metadata + unique ID).
  • Highly scalable, no hierarchy (flat namespace).
  • Example: Amazon S3, Google Cloud Storage, Azure Blob.

Best for: Backups, media files, data lake foundation, archival.


1.3 Storage Comparison

TypeSchemaScaleSpeedBest For
File (HDFS/S3)NoMassiveMediumBatch, data lake
RDBMSFixedLimitedMediumOLTP, reporting
NoSQLFlexibleHighFastReal-time, varied data
In-MemoryFlexibleLimited (RAM)Very fastCaching, ML
Object StorageNoUnlimitedMediumArchival, media

Section 2: Data Quality

2.1 What is Data Quality?

Definition:

Data Quality refers to the degree to which data is accurate, complete, consistent, timely, and fit for its intended use. Poor data quality leads to wrong analysis, bad decisions, and loss of trust.

"Garbage In, Garbage Out (GIGO)" β€” Low-quality input always produces unreliable output.


2.2 Dimensions of Data Quality

DimensionDefinitionExample of Problem
AccuracyData correctly represents real-world factsAge = 250 years
CompletenessNo missing valuesPhone number field is blank
ConsistencySame data looks the same across sources"Mumbai" vs "Bombay"
TimelinessData is up-to-dateCustomer address from 5 years ago
ValidityData conforms to defined formats/rulesEmail without "@"
UniquenessNo duplicate recordsSame customer stored twice
IntegrityRelationships between data are correctOrder with no matching customer

2.3 Common Data Quality Problems

  1. Missing values β€” null or blank fields.
  2. Duplicates β€” same record stored more than once.
  3. Inconsistent formats β€” date as "01/05/2026" vs "2026-05-01".
  4. Outliers β€” values far outside normal range (may be errors).
  5. Stale data β€” outdated information no longer valid.
  6. Referential integrity violations β€” foreign key points to non-existent record.
  7. Encoding issues β€” special characters corrupted.

2.4 Data Quality Management

Steps to ensure data quality:

1. Profiling  β†’ Analyze data to understand its current state
2. Cleansing  β†’ Fix or remove bad data
3. Validation β†’ Enforce rules at entry / ingestion
4. Monitoring β†’ Continuously check data quality over time
5. Governance β†’ Policies on who owns and is responsible for data

Tools: Apache Griffin, Great Expectations, Talend Data Quality, Informatica.


Section 3: Data Operations

3.1 What are Data Operations?

Data Operations (DataOps) refers to the set of operations performed on data during its lifecycle β€” from creation to deletion.


3.2 Core Data Operations

OperationDescriptionSQL Equivalent
CreateInsert new dataINSERT
ReadQuery and retrieve dataSELECT
UpdateModify existing dataUPDATE
DeleteRemove dataDELETE
AggregateSummarize data (sum, avg, count)GROUP BY
FilterSelect subset of dataWHERE
JoinCombine data from multiple sourcesJOIN
SortOrder data by fieldORDER BY
TransformChange data format or structureETL transform

3.3 Types of Data Operations in Big Data

Batch Operations

  • Operate on large datasets at once.
  • Scheduled jobs (nightly, weekly).
  • Example: Monthly billing, end-of-day reports.

Stream Operations

  • Operate on data as it flows in real time.
  • Low latency.
  • Example: Fraud alerts, live traffic routing.

Interactive / Ad-hoc Operations

  • User-driven queries on demand.
  • Example: Analyst runs a query to explore a dataset.

Section 4: Data Ingestion

4.1 What is Data Ingestion?

Definition:

Data Ingestion is the process of importing data from various sources into a storage or processing system for immediate use or later analysis.

It is the first step of any Big Data pipeline.


4.2 Types of Data Ingestion

1. Batch Ingestion

  • Data is collected over a period and loaded in chunks.
  • Scheduled at fixed intervals (hourly, daily, weekly).
Source β†’ Collect for 24 hours β†’ Load all at once β†’ Storage

Best for: Historical analysis, non-time-sensitive data.
Tools: Sqoop, Spark batch jobs, Airflow.


2. Real-Time / Streaming Ingestion

  • Data is ingested and processed continuously as it arrives.
  • Very low latency (milliseconds to seconds).
Source β†’ Kafka β†’ Stream Processor β†’ Storage / Dashboard

Best for: Fraud detection, live monitoring, IoT.
Tools: Apache Kafka, Amazon Kinesis, Apache Flume.


3. Micro-Batch Ingestion

  • A middle ground β€” data collected in very small batches (every few seconds/minutes).
  • Example: Spark Structured Streaming.

4.3 Data Ingestion Challenges

ChallengeDescription
Schema mismatchSource and target formats don't match
Data volume spikesSudden surge overwhelms the ingestion system
LatencyDelay between data creation and availability
Data lossNetwork failure during ingestion
Duplicate dataSame message ingested more than once

Solutions: Kafka offset tracking, idempotent writes, schema registries, checkpointing.


Section 5: Scalability and Security

5.1 Scalability in Big Data

(Refer to Unit 1, Section 4 for full detail on vertical vs horizontal scaling.)

Key Scalability Techniques in Big Data:

TechniqueDescription
PartitioningSplit data across nodes by key (e.g., by date, region)
ShardingHorizontal partitioning across database instances
ReplicationMultiple copies of data for availability and read performance
Load BalancingDistribute requests evenly across nodes
Auto-scalingAutomatically add/remove nodes based on load (cloud)
CachingStore frequently accessed data in memory

5.2 Security in Big Data

Big Data systems deal with sensitive personal, financial, and health data, making security critical.

Key Security Dimensions:

1. Authentication

  • Verify who is accessing the system.
  • Tools: Kerberos (Hadoop), LDAP, OAuth.

2. Authorization

  • Control what each user/role can do.
  • Tools: Apache Ranger, Apache Sentry (Hadoop).
  • Example: Analyst can READ data but not DELETE it.

3. Encryption

TypeDescriptionExample
At RestData encrypted on diskHDFS transparent encryption
In TransitData encrypted over networkTLS/SSL for Kafka
End-to-EndEncrypted from source to destinationZero-knowledge systems

4. Auditing

  • Log all access and operations for compliance.
  • Who accessed what data, when, and from where.
  • Tools: Apache Ranger audit logs, AWS CloudTrail.

5. Data Masking

  • Replace sensitive data with realistic fake data for testing.
  • Example: Show "XXXX-XXXX-XXXX-1234" instead of full card number.

6. Compliance

  • Laws and regulations govern data handling.
  • GDPR (EU) β€” right to access, right to delete.
  • HIPAA (US) β€” health data protection.
  • PCI-DSS β€” payment card data protection.

Section 6: Traditional DBMS vs Big Data Management Systems

6.1 Overview

Traditional DBMS was designed for an era of smaller, structured data on a single machine. Big Data Management Systems were designed for massive, diverse data across clusters of machines.


6.2 Detailed Comparison

FeatureTraditional DBMSBig Data Management System
Data VolumeGB to small TBTB to Exabytes
Data TypesStructured onlyStructured + Semi + Unstructured
SchemaFixed, schema-on-writeFlexible, schema-on-read
ScalingVertical (bigger machine)Horizontal (more machines)
ProcessingSingle nodeDistributed across cluster
Query LanguageSQLSQL + MapReduce + Dataflow
ConsistencyStrong (ACID)Eventual (BASE)
Fault ToleranceLimitedBuilt-in (replication)
CostExpensive hardwareCommodity servers
SpeedFast for small dataFast for large-scale batch
Real-timeGoodImproving (Kafka, Spark)
ExamplesMySQL, Oracle, PostgreSQLHadoop, HBase, Cassandra, Spark

6.3 When to Use What?

Use Traditional DBMS when...Use Big Data System when...
Data is structured and smallData is TB/PB scale
ACID transactions are criticalFlexibility > strict consistency
Complex joins are neededHigh write throughput needed
Mature tooling is requiredData has varied formats
Single-site deploymentDistributed, multi-node needed

Section 7: Real-Life Applications

7.1 Applications of Big Data Modeling & Management

1. Healthcare:

  • Electronic Health Records (EHR) β€” managing patient data at hospital scale.
  • Genomics β€” storing and querying 3 GB genome per patient.
  • Clinical trial management β€” tracking outcomes across thousands of patients.

2. Finance:

  • Transaction management β€” millions of records per second, ACID critical.
  • Risk modeling β€” large-scale Monte Carlo simulations on historical data.
  • Regulatory compliance β€” storing 7+ years of transaction history.

3. Retail / E-commerce:

  • Product catalog management β€” millions of SKUs with varied attributes (document DB).
  • Customer 360 β€” unified view of customer across channels.
  • Real-time inventory β€” Cassandra/HBase for low-latency stock updates.

4. Telecommunications:

  • Call Detail Records (CDR) β€” billions of records daily, column-family stores.
  • Network topology β€” graph databases for network maps.

5. Social Media:

  • User profiles β€” document stores (MongoDB).
  • Connections/followers β€” graph databases (Neo4j).
  • Posts and feeds β€” time-series column stores (Cassandra).

Section 8: Data Models

8.1 What is a Data Model?

Definition:

A Data Model is an abstract representation of how data is organized, stored, and accessed. It defines:

  • The structure of data (how it is organized).
  • The operations that can be performed on it.
  • The constraints (rules) that data must satisfy.

Purpose of a Data Model:

  • Provides a common language between developers and stakeholders.
  • Guides database design.
  • Ensures data integrity and consistency.

8.2 Three Components of a Data Model

1. Structure

  • How data is organized and stored.
  • Defines entities, attributes, and relationships.
Student (StudentID, Name, Age, CourseID)
Course  (CourseID, CourseName, Credits)

2. Operations

  • What you can do with the data.
  • CRUD operations: Create, Read, Update, Delete.
  • Queries, aggregations, transformations.
SELECT Name FROM Student WHERE Age > 20;
UPDATE Student SET Age = 23 WHERE StudentID = 001;

3. Constraints

  • Rules that data must follow to remain valid.
Constraint TypeExample
DomainAge must be between 0 and 150
KeyStudentID must be unique (primary key)
ReferentialCourseID in Student must exist in Course table
Not NullName cannot be empty
CheckSalary must be > 0

8.3 Types of Big Data Models

1. Relational Model

  • Data organized in tables (relations) with rows and columns.
  • Uses SQL.
  • Strong schema, ACID properties.
Student Table:
| ID  | Name   | Age |
|-----|--------|-----|
| 001 | Deepak | 22  |

When to use: Structured data, transactional systems.


2. Key-Value Model

  • Data stored as (key, value) pairs.
  • No structure to the value β€” it's a blob.
  • Extremely fast lookup.
"student:001" β†’ "{"name":"Deepak","age":22}"
"session:xyz" β†’ "active"

When to use: Caching, session management, shopping carts.


3. Document Model

  • Data stored as self-describing documents (JSON, BSON, XML).
  • Each document can have different fields.
  • Nested structures supported.
{
  "_id": "001",
  "name": "Deepak",
  "courses": ["ML", "BDA"],
  "address": {"city": "Jaipur"}
}

When to use: Content management, user profiles, product catalogs.


4. Column-Family Model

  • Data stored in rows but grouped by column families.
  • Optimized for reading/writing specific columns.
  • Sparse data (rows can have different columns).
RowKey    | PersonalInfo          | Academics
──────────────────────────────────────────────
student01 | name=Deepak, age=22  | math=90
student02 | name=Ankit           | math=75, sci=80

When to use: Time-series, IoT, messaging, analytics.


5. Graph Model

  • Data stored as nodes (entities) and edges (relationships).
  • Best when relationships between data are as important as the data itself.
(Deepak) ──FRIENDS──► (Ankit)
(Deepak) ──ENROLLED──► (ML Course)

When to use: Social networks, fraud detection, recommendation engines.


6. Array / Vector Model

  • Data stored in multi-dimensional arrays.
  • Used in scientific computing and geospatial data.
  • Example: Satellite imagery stored as 3D arrays (lat Γ— lon Γ— time).

When to use: Scientific data, geospatial, raster images.


8.4 Types of Big Data Models β€” Summary Table

ModelStructureQuery StyleExample SystemBest Use Case
RelationalTablesSQLMySQL, RedshiftTransactions
Key-ValueKey→ValueGET/SETRedis, DynamoDBCaching
DocumentJSON/BSON docsDocument queryMongoDBProfiles, catalogs
Column-FamilyColumn groupsRow+column keyCassandra, HBaseIoT, time-series
GraphNodes + EdgesGraph traversalNeo4j, NeptuneSocial, networks
ArrayMulti-dim arrayArray slicingSciDB, NetCDFScientific, GIS

Quick Revision Points

Data Storage Types:

  • File (HDFS/S3), Relational, NoSQL, In-Memory, Object Storage.

Data Quality Dimensions:

  • Accuracy, Completeness, Consistency, Timeliness, Validity, Uniqueness, Integrity.

Data Ingestion Types:

  • Batch (Sqoop), Streaming (Kafka), Micro-batch (Spark Streaming).

Scalability Techniques:

  • Partitioning, Sharding, Replication, Load Balancing, Caching, Auto-scaling.

Security Layers:

  • Authentication, Authorization, Encryption (at rest + in transit), Auditing, Masking, Compliance.

Traditional DBMS vs Big Data:

DBMSBig Data
ScaleGBTB–EB
SchemaFixedFlexible
ScalingVerticalHorizontal
ConsistencyACIDBASE

Data Model Components:

  • Structure β†’ how organized
  • Operations β†’ what you can do
  • Constraints β†’ rules data must follow

Types of Big Data Models:

Relational β†’ Key-Value β†’ Document β†’ Column-Family β†’ Graph β†’ Array


Expected Exam Questions

PYQs will be added after analysis β€” check back soon.


These notes were compiled by Deepak Modi
Last updated: May 2026

Found an error or want to contribute?

This content is open-source and maintained by the community. Help us improve it!