Unit 2: Programming Tools for Data Science

Syllabus:

Introduction to Programming Tools for Data Science: Toolkits using Python: Matplotlib, NumPy, Scikit-learn, NLTK, Visualizing Data: Bar Charts, Line Charts, Scatterplots, Working with data: Reading Files, Scraping the Web.

I. Introduction to Programming Tools for Data Science

Data Science requires various programming tools to handle data, analyze it, and create models. Python is the most widely used language in Data Science due to its simplicity, vast libraries, and strong community support.

Why Python for Data Science?
✅ Easy to Learn – Simple syntax, beginner-friendly.
✅ Rich Libraries – Provides powerful libraries like NumPy, Pandas, Matplotlib, etc.
✅ Open-Source & Free – No licensing cost, available for everyone.
✅ Scalability – Used in small scripts as well as large-scale applications.

II. Toolkits Using Python

Python provides various libraries (toolkits) that help in data analysis, visualization, and machine learning. Let’s explore some important ones:

A. Matplotlib – Data Visualization

1. What is Matplotlib?

Matplotlib is a Python library for data visualization. It helps create charts, graphs, and plots to understand data better. It helps in understanding trends, patterns, and relationships in data. It is widely used in Data Science, Machine Learning, and AI.

✅ Why Use Matplotlib?

Easy to use – Simple syntax like MATLAB.
Highly customizable – Change colors, labels, styles, and sizes.
Works with other libraries – Compatible with NumPy, Pandas, and Scikit-learn.
Supports multiple chart types – Line charts, bar charts, scatter plots, histograms, etc.

🔹 Installation:
Use the following command to install Matplotlib: pip install matplotlib
🔹 Importing Matplotlib: import matplotlib.pyplot as plt
pyplot is a module in Matplotlib that provides simple functions to create plots.

2. Basic Plotting with Matplotlib

Matplotlib provides various functions to create different types of plots.

(i) Line Plot – Showing Trends

A line chart is used to display data over time (e.g., stock prices, temperature changes).

import matplotlib.pyplot as plt

# Data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]

# Plot
plt.plot(x, y, marker='o', linestyle='--', color='r', label="Growth")
plt.title("Simple Line Plot")
plt.xlabel("Time")
plt.ylabel("Value")
plt.legend() # Show legend
plt.grid(True) # Add grid
plt.show()

✅ Customizations:

marker='o' → Circular markers at data points.
linestyle='--' → Dashed line style.
color='r' → Red color line.

(ii) Bar Chart – Comparing Categories

A bar chart is useful for comparing different categories (e.g., sales in different regions).

# Data
categories = ['Apple', 'Banana', 'Mango']
values = [20, 15, 25]

# Plot
plt.bar(categories, values, color=['red', 'yellow', 'orange'])
plt.title("Fruit Sales")
plt.xlabel("Fruit")
plt.ylabel("Sales")
plt.show()
![][image2]

✅ Customizations:

color → Set different colors for bars.
xlabel, ylabel, title → Labels for better understanding.

(iii) Scatter Plot – Relationship Between Two Variables

A scatter plot is used to show correlations between two numerical values or variables (e.g., height vs weight).

# Data
x = [10, 20, 30, 40, 50]
y = [15, 25, 35, 45, 55]

# Plot
plt.scatter(x, y, color='blue', marker='*', s=100)
plt.title("Scatter Plot Example")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.show()

✅ Customizations:

marker='*' → Star-shaped markers.
s=100 → Size of markers.

(iv) Histogram – Distribution of Data

A histogram shows the distribution of data values (e.g., exam scores of students).

import numpy as np

data = np.random.randn(1000) # Generate 1000 random numbers

# Plot
plt.hist(data, bins=30, color='cyan', edgecolor='black')
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram Example")
plt.show()
![][image4]

✅ Customizations:

bins=30 → Defines number of bars in the histogram.
edgecolor='black' → Adds black borders for better clarity.

3. Multiple Plots in One Figure

We can display multiple charts in one figure using subplot().

x = [1, 2, 3, 4, 5]
y1 = [5, 10, 15, 20, 25]
y2 = [10, 5, 30, 25, 15]

plt.figure(figsize=(10, 4)) # Set figure size

# First subplot (Line chart)
plt.subplot(1, 2, 1) # (rows, columns, index)
plt.plot(x, y1, color='blue')
plt.title("Line Chart")

# Second subplot (Bar chart)
plt.subplot(1, 2, 2)
plt.bar(x, y2, color='green')
plt.title("Bar Chart")

plt.show()

✅ Explanation:

subplot(1, 2, 1) → First subplot (1 row, 2 columns, 1st plot).
subplot(1, 2, 2) → Second subplot (1 row, 2 columns, 2nd plot).
figsize=(10, 4) → Sets figure size (width=10, height=4).

4. Customizing Matplotlib Charts

You can style your graphs using the following methods:
✔️ Colors (color='red')
✔️ Line styles (linestyle='--')
✔️ Markers (marker='o')
✔️ Legends (plt.legend())
✔️ Gridlines (plt.grid(True))

x = [1, 2, 3, 4, 5]
y = [2, 4, 8, 16, 32]
plt.plot(x, y, marker='o', linestyle='-', color='purple', label="Growth")
plt.xlabel("Time")
plt.ylabel("Value")
plt.title("Line Chart with Grid")
plt.legend() # Adds a label legend
plt.grid(True) # Adds a background grid
plt.show()

✅ Customization Features:

grid(True) → Adds a grid for better readability.
legend() → Shows labels for different lines.
title(), xlabel(), ylabel() → Labels for clarity.

5. Saving Plots as Images

You can save your plots as images for reports and presentations.

plt.plot([1, 2, 3, 4], [10, 20, 30, 40])
plt.title("Saved Chart")
plt.savefig("chart.png") # Saves as PNG file
plt.show()

✅ Formats Available: PNG, JPG, PDF, SVG, etc.

6. Interactive Plots with plt.show(block=False)

Matplotlib also supports interactive mode, where you can keep updating the graph dynamically.

plt.ion() # Turn on interactive mode
for i in range(5):
plt.plot([1, 2, 3, 4], [i*10, i*20, i*30, i*40])
plt.pause(1) # Pause for 1 second
plt.ioff() # Turn off interactive mode
plt.show()

B. NumPy – Numerical Computing

1. What is NumPy?

NumPy (Numerical Python) is a Python library for handling numerical data efficiently. It provides multi-dimensional arrays, mathematical functions, and linear algebra operations to handle large datasets efficiently.. It is used in Data Science, Machine Learning, AI, and Scientific Computing.

✅ Why Use NumPy?

Faster than Python lists – Uses C-based optimized functions.
Supports multi-dimensional arrays – Works like matrices in mathematics.
Has built-in mathematical functions – Supports mean, sum, min, max, standard deviation, etc.
Works well with other libraries – Pandas, Matplotlib, Scikit-Learn, etc.

🔹 Installation: To install NumPy, run: pip install numpy

🔹 Importing NumPy: import numpy as np

2. Creating Arrays in NumPy

NumPy uses nd-array (N-dimensional array) to store data. They are faster and more memory-efficient than Python lists.

(i) Creating a 1D Array (Vector)

import numpy as np

arr = np.array([1, 2, 3, 4, 5])
print("Array:", arr) # Output: [1 2 3 4 5]
print("Type:", type(arr)) # <class 'numpy.ndarray'>

(ii) Creating a 2D Array (Matrix)

matrix = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", matrix)

✅ Explanation:

1D array → [1, 2, 3, 4, 5] (like a simple list).
2D array → Works like a matrix.

3. Array Properties

Once we create an array, we can check its size, shape, and data type.

arr = np.array([[1, 2, 3], [4, 5, 6]])

print("Shape:", arr.shape) # (2, 3) → 2 rows, 3 columns
print("Size:", arr.size) # 6 elements
print("Data Type:", arr.dtype) # int64 (depends on input)

4. Creating Special Arrays

(i) Array of Zeros

zeros = np.zeros((2, 3))
print(zeros)

🔹 Output: A 2×3 matrix filled with 0s.

(ii) Array of Ones

ones = np.ones((2, 4))
print(ones)

🔹 Output: A 2×4 matrix filled with 1s.

(iii) Identity Matrix (Square Matrix with Diagonal 1s)

identity = np.eye(4)
print(identity)

(iv) Random Numbers

random_numbers = np.random.rand(3, 3) # 3×3 matrix with random values
print(random_numbers)

5. Accessing Elements in NumPy Arrays

(i) Accessing Elements (Indexing & Slicing)

arr = np.array([10, 20, 30, 40, 50])
print(arr[0]) # First element → 10
print(arr[-1]) # Last element → 50
print(arr[1:4]) # Elements from index 1 to 3 → [20, 30, 40]

(ii) Accessing Elements in 2D Arrays

matrix = np.array([[10, 20, 30], [40, 50, 60]])
print(matrix[1, 2]) # Row 1, Column 2 → 60

6. Mathematical Operations in NumPy

(i) Basic Arithmetic Operations

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])

print("Addition:", arr1 + arr2) # [5 7 9]
print("Multiplication:", arr1 * arr2) # [4 10 18]
print("Power:", arr1 ** 2) # [1 4 9]

(ii) Aggregate Functions (Sum, Mean, Min, Max, Std Dev)

arr = np.array([10, 20, 30, 40, 50])

print("Sum:", np.sum(arr))
print("Mean:", np.mean(arr))
print("Min:", np.min(arr))
print("Max:", np.max(arr))
print("Standard Deviation:", np.std(arr))

(iii) Matrix Operations

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print("Matrix Multiplication:\n", np.dot(A, B))
print("Transpose:\n", A.T)

✅ Explanation:

np.dot(A, B) → Performs matrix multiplication.
A.T → Finds the transpose of the matrix.

7. Reshaping and Resizing Arrays

(i) Reshaping a 1D Array into 2D

arr = np.array([1, 2, 3, 4, 5, 6])
reshaped = arr.reshape(2, 3) # Convert to 2×3 matrix
print(reshaped)

✅ Reshaping allows us to change dimensions of an array without changing data.

(ii) Flattening a 2D Array into 1D

matrix = np.array([[1, 2, 3], [4, 5, 6]])
flattened = matrix.flatten()
print(flattened) # [1 2 3 4 5 6]

8. Filtering and Conditional Selection in NumPy

NumPy allows filtering elements based on conditions.

arr = np.array([10, 20, 30, 40, 50])
print(arr[arr > 25]) # Elements greater than 25 → [30 40 50]

9. Saving and Loading Data in NumPy

(i) Saving NumPy Arrays to Files

arr = np.array([10, 20, 30, 40])
np.save("data.npy", arr) # Saves as .npy file

(ii) Loading Saved Arrays

loaded_arr = np.load("data.npy")
print(loaded_arr)

10. Useful NumPy Functions

a) arange() – Create a Range of Numbers

arr = np.arange(1, 10, 2) # Start=1, Stop=10, Step=2
print(arr) # Output: [1 3 5 7 9]

b) linspace() – Generate Evenly Spaced Numbers

arr = np.linspace(0, 100, 5) # 5 evenly spaced numbers between 0 and 100
print(arr) # Output: [ 0. 25. 50. 75. 100.]

c) reshape() – Change Shape of Array

arr = np.arange(1, 10).reshape(3, 3) # 3x3 Matrix
print(arr)

Output:

[[1 2 3]
[4 5 6]
[7 8 9]]

d) random() – Generate Random Numbers

rand_arr = np.random.rand(3, 3) # 3x3 matrix with random numbers
print(rand_arr)

11. Indexing and Slicing Arrays

NumPy allows fast indexing and slicing, similar to lists.

arr = np.array([10, 20, 30, 40, 50])
print(arr[1]) # Output: 20
print(arr[1:4]) # Output: [20 30 40]

For 2D arrays (matrices):

matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix[1, 2]) # Row index 1, Column index 2 → Output: 6

12. Statistical Operations

NumPy makes statistical analysis easy.

arr = np.array([10, 20, 30, 40, 50])

print(np.mean(arr)) # Output: 30.0 (Mean)
print(np.median(arr)) # Output: 30.0 (Median)
print(np.std(arr)) # Standard Deviation
print(np.sum(arr)) # Sum of all elements
print(np.max(arr)) # Maximum value
print(np.min(arr)) # Minimum value

13. Matrix Operations (Linear Algebra)

A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])

print(np.dot(A, B)) # Matrix multiplication
print(np.linalg.inv(A)) # Inverse of matrix A
print(np.linalg.det(A)) # Determinant of A

C. Scikit-Learn – Machine Learning Toolkit

1. What is Scikit-Learn?

Scikit-Learn (sklearn) is a Python library for Machine Learning (ML). It provides simple and efficient tools for data mining and analysis. It is built on NumPy, SciPy, and Matplotlib. It is used for classification, regression, clustering, dimensionality reduction, and preprocessing.

✅ Why Use Scikit-Learn?

Easy to use – Simple functions for ML models.
Supports multiple ML algorithms – Regression, Classification, Clustering, etc.
Works well with NumPy & Pandas – Efficient data handling.
Has built-in functions for data preprocessing – Scaling, encoding, splitting, etc.

🔹 Installation: To install Scikit-Learn, run: pip install scikit-learn

🔹 Importing Scikit-Learn: import sklearn

2. Machine Learning Workflow Using Scikit-Learn

Scikit-Learn follows a 5-step ML workflow:

1️⃣ Import Dataset – Load and explore the dataset.
2️⃣ Preprocess Data – Handle missing values, scale data, encode categorical features.
3️⃣ Split Dataset – Divide data into training and testing sets.
4️⃣ Train Model – Use ML algorithms to learn patterns.
5️⃣ Evaluate Model – Measure accuracy and performance.

3. Loading Datasets in Scikit-Learn

Scikit-Learn provides built-in datasets like Iris, Titanic, Wine Quality, etc.

🔹 Example: Loading the Iris Dataset

from sklearn.datasets import load_iris

iris = load_iris()
print("Feature Names:", iris.feature_names)
print("Target Names:", iris.target_names)
print("First 5 Rows of Data:\n", iris.data[:5])

✅ Explanation:

iris.feature_names → Names of features in the dataset.
iris.target_names → Categories to predict.
iris.data[:5] → First five rows of data.

4. Supervised Learning in Scikit-Learn

a) Linear Regression (Predicting Continuous Values)

Used for predicting numerical values like house prices or stock prices.

🔹 Example: Predict House Prices

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Data (Square feet vs. Price in lakhs)
X = np.array([500, 1000, 1500, 2000, 2500]).reshape(-1, 1)
y = np.array([50, 100, 150, 200, 250])

# Create and train model
model = LinearRegression()
model.fit(X, y)

# Predict price for 1800 sq.ft
predicted_price = model.predict([[1800]])
print(f"Predicted Price: {predicted_price[0]} Lakhs")

# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel("Square Feet")
plt.ylabel("Price (Lakhs)")
plt.title("Linear Regression - House Price Prediction")
plt.show()

✅ Output: Predicted Price: 179.99 Lakhs
![][image8]

b) Logistic Regression (Binary Classification - Yes/No Prediction)

Used for classifying data into categories (e.g., pass/fail, spam/not spam).

🔹 Example: Predict Whether a Student Will Pass or Fail

from sklearn.linear_model import LogisticRegression

# Data: [Hours Studied], 1 = Pass, 0 = Fail
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

# Train Model
model = LogisticRegression()
model.fit(X, y)

# Predict outcome for a student who studied 4.5 hours
predicted_outcome = model.predict([[4.5]])
print(f"Predicted Outcome: {'Pass' if predicted_outcome[0] == 1 else 'Fail'}")

✅ Output: Predicted Outcome: Pass

5. Unsupervised Learning in Scikit-Learn

a) K-Means Clustering (Grouping Similar Data)

Used for grouping similar data points without labels (e.g., customer segmentation).

🔹 Example: Clustering Customers Based on Spending Patterns

from sklearn.cluster import KMeans

# Data: [Annual Income, Spending Score]
data = np.array([[25, 30], [30, 35], [35, 40], [45, 60], [50, 65], [55, 70]])

# Create Model
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

# Get cluster labels
labels = kmeans.labels_
print("Cluster Labels:", labels)

✅ Output: Cluster Labels: [0 0 0 1 1 1]

6. Data Preprocessing in Scikit-Learn

Before training a model, we need to clean and prepare the data.

(i) Handling Missing Values

from sklearn.impute import SimpleImputer
import numpy as np

data = np.array([[10, 20, np.nan], [30, np.nan, 50], [np.nan, 40, 60]])

imputer = SimpleImputer(strategy='mean') # Replace NaN with column mean
cleaned_data = imputer.fit_transform(data)
print("Cleaned Data:\n", cleaned_data)

✅ Replaces missing values (NaN) with column mean.

Cleaned Data:
[[10. 20. 55.]
[30. 30. 50.]
[20. 40. 60.]]

(ii) Scaling Data (Standardization & Normalization)

Scaling makes sure all features have the same scale.

from sklearn.preprocessing import StandardScaler

data = np.array([[100, 10], [200, 20], [300, 30]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Scaled Data:\n", scaled_data)

✅ StandardScaler() converts values to mean = 0 and variance = 1.

Scaled Data:
[[-1.22474487 -1.22474487]
[ 0. 0. ]
[ 1.22474487 1.22474487]]

(iii) Encoding Categorical Data

If the dataset has text values, we need to convert them into numbers.

from sklearn.preprocessing import LabelEncoder

labels = ['cat', 'dog', 'fish', 'dog', 'cat']
encoder = LabelEncoder()

encoded_labels = encoder.fit_transform(labels)
print("Encoded Labels:", encoded_labels)

✅ Converts text categories ('cat', 'dog') into numbers (0, 1, 2, etc.).

Encoded Labels: [0 1 2 1 0]

7. Splitting Data into Training and Testing Sets

We split the dataset into two parts:

Training Set (80%) → Used to train the model.
Testing Set (20%) → Used to evaluate the model’s performance.

from sklearn.model_selection import train_test_split

X = iris.data # Features
y = iris.target # Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

✅ Splits data into 80% training and 20% testing.

8. Model Evaluation in Scikit-Learn

After training the model, we measure its accuracy and performance.

(i) Accuracy Score (For Classification Models)

from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

✅ Shows how many predictions were correct (0 to 1 scale).

(ii) Mean Squared Error (For Regression Models)

from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))

✅ Measures the difference between actual and predicted values.

9. Saving & Loading Trained Models

We can save the trained model for future use.

import joblib

joblib.dump(model, "saved_model.pkl") # Save model
loaded_model = joblib.load("saved_model.pkl") # Load model

print("Prediction using Loaded Model:", loaded_model.predict([[6]]))

✅ Saves and loads the trained ML model.

D. NLTK – Natural Language Toolkit for Text Processing

1. What is NLTK?

NLTK (Natural Language Toolkit) is a Python library for Natural Language Processing (NLP). It helps in analyzing, processing, and understanding human language (text data). It is widely used in text processing tasks like chatbots, search engines, and AI assistants.

✅ Why Use NLTK?

Tokenization – Breaking text into words/sentences.
Stopword Removal – Removing unnecessary words (e.g., "the", "is").
Stemming & Lemmatization – Converting words to their base form.
POS Tagging – Identifying parts of speech (noun, verb, adjective, etc.).
Sentiment Analysis – Determining positive/negative sentiment in text.

🔹 Installation: To install NLTK, run: pip install nltk

🔹 Importing NLTK:

import nltk
nltk.download('all') # Download all necessary datasets (optional)

This downloads corpora, stopwords, and tokenizers used for NLP tasks.

2. Tokenization (Splitting Text into Words/Sentences)

Tokenization means breaking text into words or sentences.

a) Word Tokenization (Splitting into Words)

import nltk
from nltk.tokenize import word_tokenize

text = "Natural Language Processing is amazing!"
tokens = word_tokenize(text)

print(tokens)
# Output: ['Natural', 'Language', 'Processing', 'is', 'amazing', '!']

b) Sentence Tokenization (Splitting into Sentences)

from nltk.tokenize import sent_tokenize

text = "I love Python. NLP is very interesting!"
sentences = sent_tokenize(text)

print(sentences)
# Output: ['I love Python.', 'NLP is very interesting!']

3. Removing Stopwords (Unimportant Words like "is", "the", "and")

Stopwords are common words like "is", "the", "and", which don’t add much meaning to the sentence.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

text = "This is a simple example of stopword removal."
words = word_tokenize(text)

stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

print(filtered_words)
# Output: ['simple', 'example', 'stopword', 'removal', '.']

4. Stemming and Lemmatization

(i) Stemming - Reducing Words to Root Form

Stemming converts words to their root form, removing suffixes.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["running", "flies", "easily", "studying"]

stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
# Output: ['run', 'fli', 'easili', 'studi']

🔹 Issues with Stemming:

"flies" → "fli" (incorrect)
"easily" → "easili" (not meaningful)

(ii) Lemmatization – Getting Meaningful Root Words

Lemmatization converts words to their meaningful base form using vocabulary. Lemmatization is like stemming but gives real words. So its better than Stemming.

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "easily", "studying"]

lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
# Output: ['running', 'fly', 'easily', study]

🔹 Lemmatization produces proper words (e.g., "flies" → "fly").

5. Part-of-Speech (POS) Tagging

NLTK can tag words as noun, verb, adjective, etc. It identifies parts of speech for each word.

from nltk import pos_tag
from nltk.tokenize import word_tokenize

text = "John is playing football."
words = word_tokenize(text)
pos_tags = pos_tag(words)

print(pos_tags)
# Output: [('John', 'NNP'), ('is', 'VBZ'), ('playing', 'VBG'), ('football', 'NN'), ('.', '.')]

🔹 POS Tags Explanation:

NNP → Proper Noun
VBZ → Verb (is)
VBG → Verb (playing)
NN → Noun (football)

6. Named Entity Recognition (NER) – Identifying Proper Names

Finds names of people, places, dates, etc.

from nltk import ne_chunk

text = "Elon Musk founded SpaceX in the United States."
words = word_tokenize(text)
pos_tags = pos_tag(words)

ner_tree = ne_chunk(pos_tags)
print(ner_tree)

This identifies "Elon Musk" as a person, "SpaceX" as an organization, and "United States" as a location.

7. Sentiment Analysis Using NLTK – Checking Positive or Negative Sentiment

Determines if text is positive, negative, or neutral.

Scores:

pos: Positive sentiment
neg: Negative sentiment
neu: Neutral sentiment
compound: Overall sentiment score

from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

text = "I love machine learning. It is so exciting!"
score = sia.polarity_scores(text)
print(score)

Output:

{'neg': 0.0, 'neu': 0.2, 'pos': 0.8, 'compound': 0.75}

Here, the positive score is high (0.8), meaning the text is positive.

8. Text Preprocessing Pipeline (Complete Example)

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

text = "NLTK is a great tool for text processing! It helps in tokenization, stemming, and sentiment analysis."

# Step 1: Tokenization
words = word_tokenize(text)

# Step 2: Removing Stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]

# Step 3: Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]

# Step 4: POS Tagging
pos_tags = pos_tag(lemmatized_words)

print("Lemmatized Words:", lemmatized_words)
print("POS Tags:", pos_tags)

III. Visualizing Data (Charts & Graphs)

Visualization is a key step in data analysis. It helps us understand patterns, trends, and relationships in data. Python provides various libraries for data visualization, and Matplotlib is one of the most popular ones.

A. Bar Charts – Comparing Categories

1. What is a Bar Chart?

A bar chart (or bar graph) is used to represent categorical data using rectangular bars. The height of each bar represents the frequency or value or quantity of that category.

X-axis (Horizontal) → Categories (e.g., Products, Countries, Days).
Y-axis (Vertical) → Numerical values (e.g., Sales, Population, Temperature).

✅ Uses of Bar Charts:

Comparing different categories easily
Showing trends or patterns in data
Useful for business analytics, survey results, and market research

2. Creating a Simple Bar Chart Using Matplotlib

Basic Bar Chart Example

import matplotlib.pyplot as plt

# Data
categories = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values = [40, 70, 30, 55, 90]

# Create Bar Chart
plt.bar(categories, values, color='skyblue')

# Labels and Title
plt.xlabel("Fruits")
plt.ylabel("Quantity Sold")
plt.title("Fruit Sales Data")

# Show the Graph
plt.show()

3. Customizing Bar Charts

(i) Changing Bar Colors

plt.bar(categories, values, color=['red', 'yellow', 'green', 'orange', 'purple'])

✅ Each bar gets a different color.

(ii) Adding Edge Color and Width

plt.bar(categories, values, color='skyblue', edgecolor='black', linewidth=2)

✅ Adds a black outline to bars.

(iii) Rotating X-Axis Labels

plt.xticks(rotation=45) # Rotates labels 45 degrees

4. Horizontal Bar Chart

Instead of a vertical bar chart, we can use a horizontal bar chart using barh().

import matplotlib.pyplot as plt

# Data
categories = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values = [40, 70, 30, 55, 90]

# Create Horizontal Bar Chart
plt.barh(categories, values, color='lightgreen')
plt.xlabel("Quantity Sold")
plt.ylabel("Fruits")
plt.title("Fruit Sales Data (Horizontal)")
plt.show()
![][image10]

✅ Useful when category labels are long.

5. Grouped Bar Chart (Comparing Multiple Categories)

A grouped bar chart compares different categories side by side.

import numpy as np

# Data
fruit_sales = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values_2023 = [50, 60, 30, 45, 80]
values_2024 = [40, 70, 50, 55, 90]

# Positioning bars side by side
x = np.arange(len(fruit_sales))

plt.bar(x - 0.2, values_2023, width=0.4, label='2023', color='blue')
plt.bar(x + 0.2, values_2024, width=0.4, label='2024', color='orange')

# Labels and title
plt.xticks(x, fruit_sales)
plt.xlabel("Fruits")
plt.ylabel("Quantity Sold")
plt.title("Fruit Sales Comparison")
plt.legend()

plt.show()
![][image11]

✅ Compares sales for two different years.

6. Stacked Bar Chart (Showing Parts of a Whole)

import numpy as np

# Data
fruit_sales = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values_2023 = [50, 60, 30, 45, 80]
values_2024 = [40, 70, 50, 55, 90]

# Stacked Bar Chart
plt.bar(fruit_sales, values_2023, color='blue', label='2023')
plt.bar(fruit_sales, values_2024, bottom=values_2023, color='orange', label='2024')

plt.xlabel("Fruits")
plt.ylabel("Total Sales")
plt.title("Stacked Bar Chart Example")
plt.legend()
plt.show()
![][image12]

✅ Displays total sales with breakdowns.

B. Line Charts – Trends Over Time

1. What is a Line Chart?

A line chart (or line graph) is used to show trends over time by connecting data points with a continuous line. It is useful for tracking growth, decline, and patterns in data.

X-axis (Horizontal): Represents time (days, months, years).
Y-axis (Vertical): Represents the measured value (temperature, sales, stock prices).

✅ Uses of Line Charts:

Stock market trends 📈
Temperature changes 🌡️
Website traffic over time 📊
Sales growth 💰

2. Creating a Line Chart Using Matplotlib

Basic Line Chart Example

import matplotlib.pyplot as plt

# Data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
sales = [10, 25, 40, 30, 50, 60]

# Create Line Chart
plt.plot(months, sales, marker='o', linestyle='-', color='blue', label="Sales Growth")

# Labels and Title
plt.xlabel("Months")
plt.ylabel("Sales (in $1000s)")
plt.title("Monthly Sales Trend")
plt.legend()

# Show the Graph
plt.show()

✅ Output: A simple line graph showing sales growth over six months.

3. Customizing Line Charts

(i) Changing Line Style, Color, and Markers

plt.plot(months, sales, marker='s', linestyle='dashed', color='red')

Uses square markers and a dashed red line.

(ii) Adding Grid Lines

plt.grid(True) # Enables grid for better readability

(iii) Changing Line Width

plt.plot(months, sales, linewidth=3) # Makes the line thicker

4. Multiple Line Charts (Comparing Two Trends)

# Data
sales_2023 = [10, 25, 40, 30, 50, 60]
sales_2024 = [15, 30, 35, 40, 55, 70]

# Create Multiple Line Chart
plt.plot(months, sales_2023, marker='o', linestyle='-', color='blue', label="2023 Sales")
plt.plot(months, sales_2024, marker='s', linestyle='--', color='green', label="2024 Sales")

# Labels and Title
plt.xlabel("Months")
plt.ylabel("Sales (in $1000s)")
plt.title("Sales Comparison: 2023 vs 2024")
plt.legend()
plt.grid(True)

# Show the Graph
plt.show()

✅ Compares sales for two different years using two lines.

5. Filling Area Under the Line (Shaded Area Chart)

plt.fill_between(months, sales, color='skyblue', alpha=0.3) # Fills area under the line
plt.plot(months, sales, marker='o', color='blue')
plt.xlabel("Months")
plt.ylabel("Sales")
plt.title("Sales Trend with Area Highlight")
plt.show()

✅ Highlights the area under the curve for better visibility.

6. Line Chart with Annotations (Highlighting Key Points)

plt.plot(months, sales, marker='o', color='blue')
plt.annotate("Highest Sales", xy=("May", 50), xytext=("Mar", 55), arrowprops=dict(arrowstyle="->"))
plt.xlabel("Months")
plt.ylabel("Sales")
plt.title("Sales Trend with Annotation")
plt.show()

✅ Adds an annotation pointing to the highest sales month.

C. Scatterplots – Relationship Between Two Variables

1. What is a Scatterplot?

A scatterplot is a graph that uses dots to represent the relationship between two numerical variables. Each point represents an observation (a pair of values). It helps in identifying patterns, trends, correlations, and outliers in data.

X-axis: Independent variable (e.g., Study Hours).
Y-axis: Dependent variable (e.g., Exam Score).

✅ Uses of Scatterplots:

Finding relationships between variables (e.g., study time vs. exam score 📚)
Detecting clusters in data (e.g., customer segmentation 🏪)
Identifying outliers (e.g., unusual temperature readings 🌡️)

2. Creating a Scatterplot Using Matplotlib

Basic Scatterplot Example

import matplotlib.pyplot as plt

# Data (Hours Studied vs Exam Scores)
hours_studied = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
exam_scores = [50, 55, 65, 70, 75, 80, 85, 88, 90, 95]

# Create Scatterplot
plt.scatter(hours_studied, exam_scores, color='blue', marker='o', label="Students")

# Labels and Title
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.title("Study Hours vs Exam Score")
plt.legend()

# Show the Graph
plt.show()

✅ Output: A scatterplot showing the relationship between study hours and exam scores. More study hours → Higher scores (positive correlation)

3. Types of Correlations in Scatterplots

(i) Positive Correlation (↑, ↑)

✔ As one variable increases, the other also increases.
✔ Example: More hours studied → Higher exam score.

(ii) Negative Correlation (↑, ↓)

✔ As one variable increases, the other decreases.
✔ Example: More exercise → Lower weight.

(iii) No Correlation (Random)

✔ No clear relationship between variables.
✔ Example: Shoe size vs. Intelligence.

4. Customizing Scatterplots

(i) Changing Marker Style, Size, and Color

plt.scatter(hours_studied, exam_scores, color='red', marker='s', s=100)

✅ Uses red square markers with a larger size.

(ii) Adding Grid Lines

plt.grid(True) # Enables grid for better readability

(iii) Adding Annotations (Highlighting Key Points)

plt.scatter(8, 88, color='green', s=200, label="Top Student")
plt.annotate("Top Score", xy=(8, 88), xytext=(5, 85), arrowprops=dict(arrowstyle="->"))

✅ Highlights the top-scoring student.

5. Scatterplot with Multiple Data Categories

# Data for Two Groups (Men vs Women)
men_hours = [1, 2, 3, 4, 5, 6, 7]
men_scores = [50, 55, 60, 65, 70, 75, 80]

women_hours = [2, 3, 4, 5, 6, 7, 8]
women_scores = [55, 60, 65, 70, 75, 80, 85]

# Create Scatterplot
plt.scatter(men_hours, men_scores, color='blue', marker='o', label="Men")
plt.scatter(women_hours, women_scores, color='red', marker='x', label="Women")

# Labels and Title
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.title("Study Hours vs Exam Score (Men vs Women)")
plt.legend()

# Show the Graph
plt.show()

✅ Compares study patterns of men and women using different markers and colors.

6. Scatterplot with Trend Line (Best-Fit Line)

Sometimes, we want to see the trend in our data by adding a best-fit line.

import numpy as np

# Fit a Trend Line
m, b = np.polyfit(hours_studied, exam_scores, 1) # Find slope & intercept
plt.scatter(hours_studied, exam_scores, color='blue', label="Students")
plt.plot(hours_studied, np.array(hours_studied) * m + b, color='red', linestyle="dashed", label="Trend Line")

# Labels and Title
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.title("Scatterplot with Trend Line")
plt.legend()
plt.show()

✅ Shows the general trend in the data using a red dashed trend line.

IV. Working with Data

In Data Science, data collection is a crucial step. Data can come from files (CSV, JSON, Excel) or web pages (scraping data from websites).

A. Reading Files (CSV, Excel, Text, JSON)

In Data Science, data is often stored in files like CSV, Excel, JSON, or text files. Python provides powerful libraries like pandas, json, and openpyxl to read and manipulate these files easily.

CSV - When data is in tabular format (rows and columns).
Excel - When data is structured with multiple sheets.
JSON - When data is in key-value pairs (API responses, configurations).
Text Files - When data is stored as plain text.

1. Reading CSV Files

CSV (Comma-Separated Values) is one of the most commonly used file formats in data science. CSV files store data in rows and columns, separated by commas (,).

(i) Reading a CSV File Using Pandas

import pandas as pd

# Read CSV file into a DataFrame
data = pd.read_csv("data.csv")

# Display first 5 rows
print(data.head())

✅ Example CSV File (data.csv)

Name	Age	Salary
Rahul	25	50000
Priya	28	60000
Amit	30	70000

(ii) Reading a CSV File Without a Header

If the file does not have column names, specify header=None:

data = pd.read_csv("data.csv", header=None)
print(data.head())

✅ Used when the file does not contain column names.

(iii) Select Specific Columns While Reading

data = pd.read_csv("data.csv", usecols=["Name", "Salary"])
print(data.head())

✅ Loads only selected columns.

(iv) Handling Missing Data

data = pd.read_csv("data.csv", na_values=["", "NA", "null"])
print(data.isnull().sum()) # Check missing values

✅ Replaces empty cells with NaN (Not a Number).

(v) Specifying a Different Delimiter (e.g., ; Instead of ,)

data = pd.read_csv("data.csv", delimiter=";")

2. Reading Excel Files

Excel files (.xlsx) are widely used for data storage and analysis. Excel files store structured data in sheets.

(i) Install OpenPyXL (for Excel support)

pip install openpyxl

(ii) Read an Excel File

data = pd.read_excel("data.xlsx")
print(data.head())

(iii) Read a Specific Sheet from an Excel File

data = pd.read_excel("data.xlsx", sheet_name="SalesData")
print(data.head())

✅ Useful when an Excel file has multiple sheets.

(iv) Save DataFrame to an Excel File

data.to_excel("new_data.xlsx", index=False)

3. Reading JSON Files

JSON (JavaScript Object Notation) is used to store structured data, especially for web applications and APIs. JSON stores data in key-value pairs like Python dictionaries.

(i) Example JSON File (data.json)

{
"name": "Rahul",
"age": 25,
"salary": 50000
}

(ii) Read JSON in Python

import json
with open("data.json", "r") as file:
data = json.load(file)

print(data)

(iii) Convert JSON to Pandas DataFrame

df = pd.DataFrame([data])
print(df)

4. Reading Text Files

Sometimes, data is stored in plain text files (.txt). Text files contain unstructured or plain text data.

(i) Read a Text File Line by Line

with open("data.txt", "r") as file:
for line in file:
print(line.strip()) # Removes extra spaces and newlines

(ii) Read the Whole File at Once

with open("data.txt", "r") as file:
content = file.read()
print(content)

B. Web Scraping – Extracting Data from Websites

Web Scraping is the process of automatically extracting data from websites. It is useful when we need real-time data from sources like news websites, stock markets, weather updates, and e-commerce sites.

✅ Why Use Web Scraping?

Collect large amounts of data for market research and analysis 📊
Extract stock prices, sports scores, weather updates 🌦️
Automate data collection from websites 🤖
Analyze news articles, job postings, or reviews 📰

Python Libraries for Web Scraping:
🔹 requests → Fetches web pages.
🔹 BeautifulSoup → Extracts data from HTML.
🔹 Selenium → Automates browsing (for dynamic websites).

1. Web Scraping Using BeautifulSoup

(i) Install Required Libraries

pip install beautifulsoup4 requests

(ii) Import Libraries

import requests
from bs4 import BeautifulSoup

2. Scraping a Website's HTML Content

Let's scrape a simple website and extract information.

url = "https://example.com" # Replace with any website
response = requests.get(url) # Get page content
soup = BeautifulSoup(response.text, "html.parser") # Parse HTML

print(soup.title.text) # Extract page title

✅ Extracts the title of the webpage.

3. Extracting Headings from a Website

headings = soup.find_all("h2") # Find all h2 headings

for h in headings:
print(h.text)

✅ Outputs all <h2> headings from the webpage.

4. Extracting Links from a Website

links = soup.find_all("a") # Find all hyperlinks <a> (anchor) tags

for link in links:
print(link.get("href")) # Extract the URL

✅ Finds and prints all links on the webpage.

5. Scraping Data from Tables

Useful for extracting stock market data, sports scores, and job listings.

table = soup.find("table") # Find table
rows = table.find_all("tr") # Find all rows

for row in rows:
columns = row.find_all("td") # Find all columns in each row
data = [col.text for col in columns]
print(data)

✅ Extracts table data row by row.

6. Scraping E-commerce Websites (Example: Product Names & Prices)

url = "https://example-ecommerce.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

products = soup.find_all("div", class_="product") # Find all product containers

for product in products:
name = product.find("h2").text # Extract product name
price = product.find("span", class_="price").text # Extract price
print(f"Product: {name} - Price: {price}")

✅ Extracts product names and prices from an e-commerce page.

7. Handling JavaScript-Rendered Websites Using Selenium

Some websites load data dynamically (e.g., scrolling feeds, login pages) using JavaScript, so BeautifulSoup alone won't work. We can use Selenium to interact with such websites.

(i) Install Selenium & WebDriver

pip install selenium

(ii) Use Selenium to Open a Website

from selenium import webdriver

driver = webdriver.Chrome() # Open Chrome browser
driver.get("https://example.com") # Open website

print(driver.title) # Print page title
driver.quit() # Close browser

✅ Useful for scraping dynamically loaded data.

8. Saving Scraped Data to CSV File

Once we scrape data, we can store it in a CSV file for analysis.

import pandas as pd

data = {
"Product": ["Laptop", "Smartphone"],
"Price": ["₹50,000", "₹30,000"]
}

df = pd.DataFrame(data)
df.to_csv("scraped_data.csv", index=False)

✅ Saves scraped product data in a CSV file.

9. Ethics of Web Scraping 🛑

🚨 Important Guidelines:

Always check robots.txt of the website before scraping.
Respect server limits (avoid frequent requests).
Do not overload a website with frequent requests (use delays).
Do not scrape login-protected personal or sensitive data.
Use API if available (APIs are better than scraping).