Syllabus:
Introduction to Programming Tools for Data Science: Toolkits using Python: Matplotlib, NumPy, Scikit-learn, NLTK, Visualizing Data: Bar Charts, Line Charts, Scatterplots, Working with data: Reading Files, Scraping the Web.
I. Introduction to Programming Tools for Data Science
Data Science requires various programming tools to handle data, analyze it, and create models. Python is the most widely used language in Data Science due to its simplicity, vast libraries, and strong community support.
Why Python for Data Science?
✅ Easy to Learn – Simple syntax, beginner-friendly.
✅ Rich Libraries – Provides powerful libraries like NumPy, Pandas, Matplotlib, etc.
✅ Open-Source & Free – No licensing cost, available for everyone.
✅ Scalability – Used in small scripts as well as large-scale applications.
II. Toolkits Using Python
Python provides various libraries (toolkits) that help in data analysis, visualization, and machine learning. Let’s explore some important ones:
A. Matplotlib – Data Visualization
1. What is Matplotlib?
Matplotlib is a Python library for data visualization. It helps create charts, graphs, and plots to understand data better. It helps in understanding trends, patterns, and relationships in data. It is widely used in Data Science, Machine Learning, and AI.
✅ Why Use Matplotlib?
- Easy to use – Simple syntax like MATLAB.
- Highly customizable – Change colors, labels, styles, and sizes.
- Works with other libraries – Compatible with NumPy, Pandas, and Scikit-learn.
- Supports multiple chart types – Line charts, bar charts, scatter plots, histograms, etc.
🔹 Installation:
Use the following command to install Matplotlib: pip install matplotlib
🔹 Importing Matplotlib: import matplotlib.pyplot as plt
pyplot is a module in Matplotlib that provides simple functions to create plots.
2. Basic Plotting with Matplotlib
Matplotlib provides various functions to create different types of plots.
(i) Line Plot – Showing Trends
A line chart is used to display data over time (e.g., stock prices, temperature changes).
import matplotlib.pyplot as plt
# Data
x = [1, 2, 3, 4, 5]
y = [10, 20, 25, 30, 40]
# Plot
plt.plot(x, y, marker='o', linestyle='--', color='r', label="Growth")
plt.title("Simple Line Plot")
plt.xlabel("Time")
plt.ylabel("Value")
plt.legend() # Show legend
plt.grid(True) # Add grid
plt.show()
✅ Customizations:
marker='o'→ Circular markers at data points.linestyle='--'→ Dashed line style.color='r'→ Red color line.
(ii) Bar Chart – Comparing Categories
A bar chart is useful for comparing different categories (e.g., sales in different regions).
# Data
categories = ['Apple', 'Banana', 'Mango']
values = [20, 15, 25]
# Plot
plt.bar(categories, values, color=['red', 'yellow', 'orange'])
plt.title("Fruit Sales")
plt.xlabel("Fruit")
plt.ylabel("Sales")
plt.show()
![][image2]
✅ Customizations:
color→ Set different colors for bars.xlabel, ylabel, title→ Labels for better understanding.
(iii) Scatter Plot – Relationship Between Two Variables
A scatter plot is used to show correlations between two numerical values or variables (e.g., height vs weight).
# Data
x = [10, 20, 30, 40, 50]
y = [15, 25, 35, 45, 55]
# Plot
plt.scatter(x, y, color='blue', marker='*', s=100)
plt.title("Scatter Plot Example")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.show()
✅ Customizations:
marker='*'→ Star-shaped markers.s=100→ Size of markers.
(iv) Histogram – Distribution of Data
A histogram shows the distribution of data values (e.g., exam scores of students).
import numpy as np
data = np.random.randn(1000) # Generate 1000 random numbers
# Plot
plt.hist(data, bins=30, color='cyan', edgecolor='black')
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.title("Histogram Example")
plt.show()
![][image4]
✅ Customizations:
bins=30→ Defines number of bars in the histogram.edgecolor='black'→ Adds black borders for better clarity.
3. Multiple Plots in One Figure
We can display multiple charts in one figure using subplot().
x = [1, 2, 3, 4, 5]
y1 = [5, 10, 15, 20, 25]
y2 = [10, 5, 30, 25, 15]
plt.figure(figsize=(10, 4)) # Set figure size
# First subplot (Line chart)
plt.subplot(1, 2, 1) # (rows, columns, index)
plt.plot(x, y1, color='blue')
plt.title("Line Chart")
# Second subplot (Bar chart)
plt.subplot(1, 2, 2)
plt.bar(x, y2, color='green')
plt.title("Bar Chart")
plt.show()
✅ Explanation:
subplot(1, 2, 1)→ First subplot (1 row, 2 columns, 1st plot).subplot(1, 2, 2)→ Second subplot (1 row, 2 columns, 2nd plot).figsize=(10, 4)→ Sets figure size (width=10, height=4).
4. Customizing Matplotlib Charts
You can style your graphs using the following methods:
✔️ Colors (color='red')
✔️ Line styles (linestyle='--')
✔️ Markers (marker='o')
✔️ Legends (plt.legend())
✔️ Gridlines (plt.grid(True))
x = [1, 2, 3, 4, 5]
y = [2, 4, 8, 16, 32]
plt.plot(x, y, marker='o', linestyle='-', color='purple', label="Growth")
plt.xlabel("Time")
plt.ylabel("Value")
plt.title("Line Chart with Grid")
plt.legend() # Adds a label legend
plt.grid(True) # Adds a background grid
plt.show()
✅ Customization Features:
grid(True)→ Adds a grid for better readability.legend()→ Shows labels for different lines.title(), xlabel(), ylabel()→ Labels for clarity.
5. Saving Plots as Images
You can save your plots as images for reports and presentations.
plt.plot([1, 2, 3, 4], [10, 20, 30, 40])
plt.title("Saved Chart")
plt.savefig("chart.png") # Saves as PNG file
plt.show()
✅ Formats Available: PNG, JPG, PDF, SVG, etc.
6. Interactive Plots with plt.show(block=False)
Matplotlib also supports interactive mode, where you can keep updating the graph dynamically.
plt.ion() # Turn on interactive mode
for i in range(5):
plt.plot([1, 2, 3, 4], [i*10, i*20, i*30, i*40])
plt.pause(1) # Pause for 1 second
plt.ioff() # Turn off interactive mode
plt.show()
B. NumPy – Numerical Computing
1. What is NumPy?
NumPy (Numerical Python) is a Python library for handling numerical data efficiently. It provides multi-dimensional arrays, mathematical functions, and linear algebra operations to handle large datasets efficiently.. It is used in Data Science, Machine Learning, AI, and Scientific Computing.
✅ Why Use NumPy?
- Faster than Python lists – Uses C-based optimized functions.
- Supports multi-dimensional arrays – Works like matrices in mathematics.
- Has built-in mathematical functions – Supports mean, sum, min, max, standard deviation, etc.
- Works well with other libraries – Pandas, Matplotlib, Scikit-Learn, etc.
🔹 Installation: To install NumPy, run: pip install numpy
🔹 Importing NumPy: import numpy as np
2. Creating Arrays in NumPy
NumPy uses nd-array (N-dimensional array) to store data. They are faster and more memory-efficient than Python lists.
(i) Creating a 1D Array (Vector)
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
print("Array:", arr) # Output: [1 2 3 4 5]
print("Type:", type(arr)) # <class 'numpy.ndarray'>
(ii) Creating a 2D Array (Matrix)
matrix = np.array([[1, 2, 3], [4, 5, 6]])
print("2D Array:\n", matrix)
✅ Explanation:
- 1D array → [1, 2, 3, 4, 5] (like a simple list).
- 2D array → Works like a matrix.
3. Array Properties
Once we create an array, we can check its size, shape, and data type.
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("Shape:", arr.shape) # (2, 3) → 2 rows, 3 columns
print("Size:", arr.size) # 6 elements
print("Data Type:", arr.dtype) # int64 (depends on input)
4. Creating Special Arrays
(i) Array of Zeros
zeros = np.zeros((2, 3))
print(zeros)
🔹 Output: A 2×3 matrix filled with 0s.
(ii) Array of Ones
ones = np.ones((2, 4))
print(ones)
🔹 Output: A 2×4 matrix filled with 1s.
(iii) Identity Matrix (Square Matrix with Diagonal 1s)
identity = np.eye(4)
print(identity)
(iv) Random Numbers
random_numbers = np.random.rand(3, 3) # 3×3 matrix with random values
print(random_numbers)
5. Accessing Elements in NumPy Arrays
(i) Accessing Elements (Indexing & Slicing)
arr = np.array([10, 20, 30, 40, 50])
print(arr[0]) # First element → 10
print(arr[-1]) # Last element → 50
print(arr[1:4]) # Elements from index 1 to 3 → [20, 30, 40]
(ii) Accessing Elements in 2D Arrays
matrix = np.array([[10, 20, 30], [40, 50, 60]])
print(matrix[1, 2]) # Row 1, Column 2 → 60
6. Mathematical Operations in NumPy
(i) Basic Arithmetic Operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print("Addition:", arr1 + arr2) # [5 7 9]
print("Multiplication:", arr1 * arr2) # [4 10 18]
print("Power:", arr1 ** 2) # [1 4 9]
(ii) Aggregate Functions (Sum, Mean, Min, Max, Std Dev)
arr = np.array([10, 20, 30, 40, 50])
print("Sum:", np.sum(arr))
print("Mean:", np.mean(arr))
print("Min:", np.min(arr))
print("Max:", np.max(arr))
print("Standard Deviation:", np.std(arr))
(iii) Matrix Operations
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print("Matrix Multiplication:\n", np.dot(A, B))
print("Transpose:\n", A.T)
✅ Explanation:
np.dot(A, B)→ Performs matrix multiplication.A.T→ Finds the transpose of the matrix.
7. Reshaping and Resizing Arrays
(i) Reshaping a 1D Array into 2D
arr = np.array([1, 2, 3, 4, 5, 6])
reshaped = arr.reshape(2, 3) # Convert to 2×3 matrix
print(reshaped)
✅ Reshaping allows us to change dimensions of an array without changing data.
(ii) Flattening a 2D Array into 1D
matrix = np.array([[1, 2, 3], [4, 5, 6]])
flattened = matrix.flatten()
print(flattened) # [1 2 3 4 5 6]
8. Filtering and Conditional Selection in NumPy
NumPy allows filtering elements based on conditions.
arr = np.array([10, 20, 30, 40, 50])
print(arr[arr > 25]) # Elements greater than 25 → [30 40 50]
9. Saving and Loading Data in NumPy
(i) Saving NumPy Arrays to Files
arr = np.array([10, 20, 30, 40])
np.save("data.npy", arr) # Saves as .npy file
(ii) Loading Saved Arrays
loaded_arr = np.load("data.npy")
print(loaded_arr)
10. Useful NumPy Functions
a) arange() – Create a Range of Numbers
arr = np.arange(1, 10, 2) # Start=1, Stop=10, Step=2
print(arr) # Output: [1 3 5 7 9]
b) linspace() – Generate Evenly Spaced Numbers
arr = np.linspace(0, 100, 5) # 5 evenly spaced numbers between 0 and 100
print(arr) # Output: [ 0. 25. 50. 75. 100.]
c) reshape() – Change Shape of Array
arr = np.arange(1, 10).reshape(3, 3) # 3x3 Matrix
print(arr)
Output:
[[1 2 3]
[4 5 6]
[7 8 9]]
d) random() – Generate Random Numbers
rand_arr = np.random.rand(3, 3) # 3x3 matrix with random numbers
print(rand_arr)
11. Indexing and Slicing Arrays
NumPy allows fast indexing and slicing, similar to lists.
arr = np.array([10, 20, 30, 40, 50])
print(arr[1]) # Output: 20
print(arr[1:4]) # Output: [20 30 40]
For 2D arrays (matrices):
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(matrix[1, 2]) # Row index 1, Column index 2 → Output: 6
12. Statistical Operations
NumPy makes statistical analysis easy.
arr = np.array([10, 20, 30, 40, 50])
print(np.mean(arr)) # Output: 30.0 (Mean)
print(np.median(arr)) # Output: 30.0 (Median)
print(np.std(arr)) # Standard Deviation
print(np.sum(arr)) # Sum of all elements
print(np.max(arr)) # Maximum value
print(np.min(arr)) # Minimum value
13. Matrix Operations (Linear Algebra)
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print(np.dot(A, B)) # Matrix multiplication
print(np.linalg.inv(A)) # Inverse of matrix A
print(np.linalg.det(A)) # Determinant of A
C. Scikit-Learn – Machine Learning Toolkit
1. What is Scikit-Learn?
Scikit-Learn (sklearn) is a Python library for Machine Learning (ML). It provides simple and efficient tools for data mining and analysis. It is built on NumPy, SciPy, and Matplotlib. It is used for classification, regression, clustering, dimensionality reduction, and preprocessing.
✅ Why Use Scikit-Learn?
- Easy to use – Simple functions for ML models.
- Supports multiple ML algorithms – Regression, Classification, Clustering, etc.
- Works well with NumPy & Pandas – Efficient data handling.
- Has built-in functions for data preprocessing – Scaling, encoding, splitting, etc.
🔹 Installation: To install Scikit-Learn, run: pip install scikit-learn
🔹 Importing Scikit-Learn: import sklearn
2. Machine Learning Workflow Using Scikit-Learn
Scikit-Learn follows a 5-step ML workflow:
1️⃣ Import Dataset – Load and explore the dataset.
2️⃣ Preprocess Data – Handle missing values, scale data, encode categorical features.
3️⃣ Split Dataset – Divide data into training and testing sets.
4️⃣ Train Model – Use ML algorithms to learn patterns.
5️⃣ Evaluate Model – Measure accuracy and performance.
3. Loading Datasets in Scikit-Learn
Scikit-Learn provides built-in datasets like Iris, Titanic, Wine Quality, etc.
🔹 Example: Loading the Iris Dataset
from sklearn.datasets import load_iris
iris = load_iris()
print("Feature Names:", iris.feature_names)
print("Target Names:", iris.target_names)
print("First 5 Rows of Data:\n", iris.data[:5])
✅ Explanation:
iris.feature_names→ Names of features in the dataset.iris.target_names→ Categories to predict.iris.data[:5]→ First five rows of data.
4. Supervised Learning in Scikit-Learn
a) Linear Regression (Predicting Continuous Values)
Used for predicting numerical values like house prices or stock prices.
🔹 Example: Predict House Prices
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
# Data (Square feet vs. Price in lakhs)
X = np.array([500, 1000, 1500, 2000, 2500]).reshape(-1, 1)
y = np.array([50, 100, 150, 200, 250])
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Predict price for 1800 sq.ft
predicted_price = model.predict([[1800]])
print(f"Predicted Price: {predicted_price[0]} Lakhs")
# Plot
plt.scatter(X, y, color='blue')
plt.plot(X, model.predict(X), color='red')
plt.xlabel("Square Feet")
plt.ylabel("Price (Lakhs)")
plt.title("Linear Regression - House Price Prediction")
plt.show()
✅ Output: Predicted Price: 179.99 Lakhs
![][image8]
b) Logistic Regression (Binary Classification - Yes/No Prediction)
Used for classifying data into categories (e.g., pass/fail, spam/not spam).
🔹 Example: Predict Whether a Student Will Pass or Fail
from sklearn.linear_model import LogisticRegression
# Data: [Hours Studied], 1 = Pass, 0 = Fail
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
# Train Model
model = LogisticRegression()
model.fit(X, y)
# Predict outcome for a student who studied 4.5 hours
predicted_outcome = model.predict([[4.5]])
print(f"Predicted Outcome: {'Pass' if predicted_outcome[0] == 1 else 'Fail'}")
✅ Output: Predicted Outcome: Pass
5. Unsupervised Learning in Scikit-Learn
a) K-Means Clustering (Grouping Similar Data)
Used for grouping similar data points without labels (e.g., customer segmentation).
🔹 Example: Clustering Customers Based on Spending Patterns
from sklearn.cluster import KMeans
# Data: [Annual Income, Spending Score]
data = np.array([[25, 30], [30, 35], [35, 40], [45, 60], [50, 65], [55, 70]])
# Create Model
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
# Get cluster labels
labels = kmeans.labels_
print("Cluster Labels:", labels)
✅ Output: Cluster Labels: [0 0 0 1 1 1]
6. Data Preprocessing in Scikit-Learn
Before training a model, we need to clean and prepare the data.
(i) Handling Missing Values
from sklearn.impute import SimpleImputer
import numpy as np
data = np.array([[10, 20, np.nan], [30, np.nan, 50], [np.nan, 40, 60]])
imputer = SimpleImputer(strategy='mean') # Replace NaN with column mean
cleaned_data = imputer.fit_transform(data)
print("Cleaned Data:\n", cleaned_data)
✅ Replaces missing values (NaN) with column mean.
Cleaned Data:
[[10. 20. 55.]
[30. 30. 50.]
[20. 40. 60.]]
(ii) Scaling Data (Standardization & Normalization)
Scaling makes sure all features have the same scale.
from sklearn.preprocessing import StandardScaler
data = np.array([[100, 10], [200, 20], [300, 30]])
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
print("Scaled Data:\n", scaled_data)
✅ StandardScaler() converts values to mean = 0 and variance = 1.
Scaled Data:
[[-1.22474487 -1.22474487]
[ 0. 0. ]
[ 1.22474487 1.22474487]]
(iii) Encoding Categorical Data
If the dataset has text values, we need to convert them into numbers.
from sklearn.preprocessing import LabelEncoder
labels = ['cat', 'dog', 'fish', 'dog', 'cat']
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)
print("Encoded Labels:", encoded_labels)
✅ Converts text categories ('cat', 'dog') into numbers (0, 1, 2, etc.).
Encoded Labels: [0 1 2 1 0]
7. Splitting Data into Training and Testing Sets
We split the dataset into two parts:
- Training Set (80%) → Used to train the model.
- Testing Set (20%) → Used to evaluate the model’s performance.
from sklearn.model_selection import train_test_split
X = iris.data # Features
y = iris.target # Labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)
✅ Splits data into 80% training and 20% testing.
8. Model Evaluation in Scikit-Learn
After training the model, we measure its accuracy and performance.
(i) Accuracy Score (For Classification Models)
from sklearn.metrics import accuracy_score
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
✅ Shows how many predictions were correct (0 to 1 scale).
(ii) Mean Squared Error (For Regression Models)
from sklearn.metrics import mean_squared_error
y_pred = model.predict(X_test)
print("MSE:", mean_squared_error(y_test, y_pred))
✅ Measures the difference between actual and predicted values.
9. Saving & Loading Trained Models
We can save the trained model for future use.
import joblib
joblib.dump(model, "saved_model.pkl") # Save model
loaded_model = joblib.load("saved_model.pkl") # Load model
print("Prediction using Loaded Model:", loaded_model.predict([[6]]))
✅ Saves and loads the trained ML model.
D. NLTK – Natural Language Toolkit for Text Processing
1. What is NLTK?
NLTK (Natural Language Toolkit) is a Python library for Natural Language Processing (NLP). It helps in analyzing, processing, and understanding human language (text data). It is widely used in text processing tasks like chatbots, search engines, and AI assistants.
✅ Why Use NLTK?
- Tokenization – Breaking text into words/sentences.
- Stopword Removal – Removing unnecessary words (e.g., "the", "is").
- Stemming & Lemmatization – Converting words to their base form.
- POS Tagging – Identifying parts of speech (noun, verb, adjective, etc.).
- Sentiment Analysis – Determining positive/negative sentiment in text.
🔹 Installation: To install NLTK, run: pip install nltk
🔹 Importing NLTK:
import nltk
nltk.download('all') # Download all necessary datasets (optional)
This downloads corpora, stopwords, and tokenizers used for NLP tasks.
2. Tokenization (Splitting Text into Words/Sentences)
Tokenization means breaking text into words or sentences.
a) Word Tokenization (Splitting into Words)
import nltk
from nltk.tokenize import word_tokenize
text = "Natural Language Processing is amazing!"
tokens = word_tokenize(text)
print(tokens)
# Output: ['Natural', 'Language', 'Processing', 'is', 'amazing', '!']
b) Sentence Tokenization (Splitting into Sentences)
from nltk.tokenize import sent_tokenize
text = "I love Python. NLP is very interesting!"
sentences = sent_tokenize(text)
print(sentences)
# Output: ['I love Python.', 'NLP is very interesting!']
3. Removing Stopwords (Unimportant Words like "is", "the", "and")
Stopwords are common words like "is", "the", "and", which don’t add much meaning to the sentence.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
text = "This is a simple example of stopword removal."
words = word_tokenize(text)
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print(filtered_words)
# Output: ['simple', 'example', 'stopword', 'removal', '.']
4. Stemming and Lemmatization
(i) Stemming - Reducing Words to Root Form
Stemming converts words to their root form, removing suffixes.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "flies", "easily", "studying"]
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
# Output: ['run', 'fli', 'easili', 'studi']
🔹 Issues with Stemming:
- "flies" → "fli" (incorrect)
- "easily" → "easili" (not meaningful)
(ii) Lemmatization – Getting Meaningful Root Words
Lemmatization converts words to their meaningful base form using vocabulary. Lemmatization is like stemming but gives real words. So its better than Stemming.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "flies", "easily", "studying"]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
# Output: ['running', 'fly', 'easily', study]
🔹 Lemmatization produces proper words (e.g., "flies" → "fly").
5. Part-of-Speech (POS) Tagging
NLTK can tag words as noun, verb, adjective, etc. It identifies parts of speech for each word.
from nltk import pos_tag
from nltk.tokenize import word_tokenize
text = "John is playing football."
words = word_tokenize(text)
pos_tags = pos_tag(words)
print(pos_tags)
# Output: [('John', 'NNP'), ('is', 'VBZ'), ('playing', 'VBG'), ('football', 'NN'), ('.', '.')]
🔹 POS Tags Explanation:
- NNP → Proper Noun
- VBZ → Verb (is)
- VBG → Verb (playing)
- NN → Noun (football)
6. Named Entity Recognition (NER) – Identifying Proper Names
Finds names of people, places, dates, etc.
from nltk import ne_chunk
text = "Elon Musk founded SpaceX in the United States."
words = word_tokenize(text)
pos_tags = pos_tag(words)
ner_tree = ne_chunk(pos_tags)
print(ner_tree)
This identifies "Elon Musk" as a person, "SpaceX" as an organization, and "United States" as a location.
7. Sentiment Analysis Using NLTK – Checking Positive or Negative Sentiment
Determines if text is positive, negative, or neutral.
Scores:
pos: Positive sentimentneg: Negative sentimentneu: Neutral sentimentcompound: Overall sentiment score
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()
text = "I love machine learning. It is so exciting!"
score = sia.polarity_scores(text)
print(score)
Output:
{'neg': 0.0, 'neu': 0.2, 'pos': 0.8, 'compound': 0.75}
Here, the positive score is high (0.8), meaning the text is positive.
8. Text Preprocessing Pipeline (Complete Example)
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
text = "NLTK is a great tool for text processing! It helps in tokenization, stemming, and sentiment analysis."
# Step 1: Tokenization
words = word_tokenize(text)
# Step 2: Removing Stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
# Step 3: Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
# Step 4: POS Tagging
pos_tags = pos_tag(lemmatized_words)
print("Lemmatized Words:", lemmatized_words)
print("POS Tags:", pos_tags)
III. Visualizing Data (Charts & Graphs)
Visualization is a key step in data analysis. It helps us understand patterns, trends, and relationships in data. Python provides various libraries for data visualization, and Matplotlib is one of the most popular ones.
A. Bar Charts – Comparing Categories
1. What is a Bar Chart?
A bar chart (or bar graph) is used to represent categorical data using rectangular bars. The height of each bar represents the frequency or value or quantity of that category.
- X-axis (Horizontal) → Categories (e.g., Products, Countries, Days).
- Y-axis (Vertical) → Numerical values (e.g., Sales, Population, Temperature).
✅ Uses of Bar Charts:
- Comparing different categories easily
- Showing trends or patterns in data
- Useful for business analytics, survey results, and market research
2. Creating a Simple Bar Chart Using Matplotlib
Basic Bar Chart Example
import matplotlib.pyplot as plt
# Data
categories = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values = [40, 70, 30, 55, 90]
# Create Bar Chart
plt.bar(categories, values, color='skyblue')
# Labels and Title
plt.xlabel("Fruits")
plt.ylabel("Quantity Sold")
plt.title("Fruit Sales Data")
# Show the Graph
plt.show()
3. Customizing Bar Charts
(i) Changing Bar Colors
plt.bar(categories, values, color=['red', 'yellow', 'green', 'orange', 'purple'])
✅ Each bar gets a different color.
(ii) Adding Edge Color and Width
plt.bar(categories, values, color='skyblue', edgecolor='black', linewidth=2)
✅ Adds a black outline to bars.
(iii) Rotating X-Axis Labels
plt.xticks(rotation=45) # Rotates labels 45 degrees
4. Horizontal Bar Chart
Instead of a vertical bar chart, we can use a horizontal bar chart using barh().
import matplotlib.pyplot as plt
# Data
categories = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values = [40, 70, 30, 55, 90]
# Create Horizontal Bar Chart
plt.barh(categories, values, color='lightgreen')
plt.xlabel("Quantity Sold")
plt.ylabel("Fruits")
plt.title("Fruit Sales Data (Horizontal)")
plt.show()
![][image10]
✅ Useful when category labels are long.
5. Grouped Bar Chart (Comparing Multiple Categories)
A grouped bar chart compares different categories side by side.
import numpy as np
# Data
fruit_sales = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values_2023 = [50, 60, 30, 45, 80]
values_2024 = [40, 70, 50, 55, 90]
# Positioning bars side by side
x = np.arange(len(fruit_sales))
plt.bar(x - 0.2, values_2023, width=0.4, label='2023', color='blue')
plt.bar(x + 0.2, values_2024, width=0.4, label='2024', color='orange')
# Labels and title
plt.xticks(x, fruit_sales)
plt.xlabel("Fruits")
plt.ylabel("Quantity Sold")
plt.title("Fruit Sales Comparison")
plt.legend()
plt.show()
![][image11]
✅ Compares sales for two different years.
6. Stacked Bar Chart (Showing Parts of a Whole)
import numpy as np
# Data
fruit_sales = ['Apple', 'Banana', 'Mango', 'Orange', 'Grapes']
values_2023 = [50, 60, 30, 45, 80]
values_2024 = [40, 70, 50, 55, 90]
# Stacked Bar Chart
plt.bar(fruit_sales, values_2023, color='blue', label='2023')
plt.bar(fruit_sales, values_2024, bottom=values_2023, color='orange', label='2024')
plt.xlabel("Fruits")
plt.ylabel("Total Sales")
plt.title("Stacked Bar Chart Example")
plt.legend()
plt.show()
![][image12]
✅ Displays total sales with breakdowns.
B. Line Charts – Trends Over Time
1. What is a Line Chart?
A line chart (or line graph) is used to show trends over time by connecting data points with a continuous line. It is useful for tracking growth, decline, and patterns in data.
- X-axis (Horizontal): Represents time (days, months, years).
- Y-axis (Vertical): Represents the measured value (temperature, sales, stock prices).
✅ Uses of Line Charts:
- Stock market trends 📈
- Temperature changes 🌡️
- Website traffic over time 📊
- Sales growth 💰
2. Creating a Line Chart Using Matplotlib
Basic Line Chart Example
import matplotlib.pyplot as plt
# Data
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
sales = [10, 25, 40, 30, 50, 60]
# Create Line Chart
plt.plot(months, sales, marker='o', linestyle='-', color='blue', label="Sales Growth")
# Labels and Title
plt.xlabel("Months")
plt.ylabel("Sales (in $1000s)")
plt.title("Monthly Sales Trend")
plt.legend()
# Show the Graph
plt.show()
✅ Output: A simple line graph showing sales growth over six months.
3. Customizing Line Charts
(i) Changing Line Style, Color, and Markers
plt.plot(months, sales, marker='s', linestyle='dashed', color='red')
Uses square markers and a dashed red line.
(ii) Adding Grid Lines
plt.grid(True) # Enables grid for better readability
(iii) Changing Line Width
plt.plot(months, sales, linewidth=3) # Makes the line thicker
4. Multiple Line Charts (Comparing Two Trends)
# Data
sales_2023 = [10, 25, 40, 30, 50, 60]
sales_2024 = [15, 30, 35, 40, 55, 70]
# Create Multiple Line Chart
plt.plot(months, sales_2023, marker='o', linestyle='-', color='blue', label="2023 Sales")
plt.plot(months, sales_2024, marker='s', linestyle='--', color='green', label="2024 Sales")
# Labels and Title
plt.xlabel("Months")
plt.ylabel("Sales (in $1000s)")
plt.title("Sales Comparison: 2023 vs 2024")
plt.legend()
plt.grid(True)
# Show the Graph
plt.show()
✅ Compares sales for two different years using two lines.
5. Filling Area Under the Line (Shaded Area Chart)
plt.fill_between(months, sales, color='skyblue', alpha=0.3) # Fills area under the line
plt.plot(months, sales, marker='o', color='blue')
plt.xlabel("Months")
plt.ylabel("Sales")
plt.title("Sales Trend with Area Highlight")
plt.show()
✅ Highlights the area under the curve for better visibility.
6. Line Chart with Annotations (Highlighting Key Points)
plt.plot(months, sales, marker='o', color='blue')
plt.annotate("Highest Sales", xy=("May", 50), xytext=("Mar", 55), arrowprops=dict(arrowstyle="->"))
plt.xlabel("Months")
plt.ylabel("Sales")
plt.title("Sales Trend with Annotation")
plt.show()
✅ Adds an annotation pointing to the highest sales month.
C. Scatterplots – Relationship Between Two Variables
1. What is a Scatterplot?
A scatterplot is a graph that uses dots to represent the relationship between two numerical variables. Each point represents an observation (a pair of values). It helps in identifying patterns, trends, correlations, and outliers in data.
- X-axis: Independent variable (e.g., Study Hours).
- Y-axis: Dependent variable (e.g., Exam Score).
✅ Uses of Scatterplots:
- Finding relationships between variables (e.g., study time vs. exam score 📚)
- Detecting clusters in data (e.g., customer segmentation 🏪)
- Identifying outliers (e.g., unusual temperature readings 🌡️)
2. Creating a Scatterplot Using Matplotlib
Basic Scatterplot Example
import matplotlib.pyplot as plt
# Data (Hours Studied vs Exam Scores)
hours_studied = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
exam_scores = [50, 55, 65, 70, 75, 80, 85, 88, 90, 95]
# Create Scatterplot
plt.scatter(hours_studied, exam_scores, color='blue', marker='o', label="Students")
# Labels and Title
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.title("Study Hours vs Exam Score")
plt.legend()
# Show the Graph
plt.show()
✅ Output: A scatterplot showing the relationship between study hours and exam scores. More study hours → Higher scores (positive correlation)
3. Types of Correlations in Scatterplots
(i) Positive Correlation (↑, ↑)
✔ As one variable increases, the other also increases.
✔ Example: More hours studied → Higher exam score.
(ii) Negative Correlation (↑, ↓)
✔ As one variable increases, the other decreases.
✔ Example: More exercise → Lower weight.
(iii) No Correlation (Random)
✔ No clear relationship between variables.
✔ Example: Shoe size vs. Intelligence.
4. Customizing Scatterplots
(i) Changing Marker Style, Size, and Color
plt.scatter(hours_studied, exam_scores, color='red', marker='s', s=100)
✅ Uses red square markers with a larger size.
(ii) Adding Grid Lines
plt.grid(True) # Enables grid for better readability
(iii) Adding Annotations (Highlighting Key Points)
plt.scatter(8, 88, color='green', s=200, label="Top Student")
plt.annotate("Top Score", xy=(8, 88), xytext=(5, 85), arrowprops=dict(arrowstyle="->"))
✅ Highlights the top-scoring student.
5. Scatterplot with Multiple Data Categories
# Data for Two Groups (Men vs Women)
men_hours = [1, 2, 3, 4, 5, 6, 7]
men_scores = [50, 55, 60, 65, 70, 75, 80]
women_hours = [2, 3, 4, 5, 6, 7, 8]
women_scores = [55, 60, 65, 70, 75, 80, 85]
# Create Scatterplot
plt.scatter(men_hours, men_scores, color='blue', marker='o', label="Men")
plt.scatter(women_hours, women_scores, color='red', marker='x', label="Women")
# Labels and Title
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.title("Study Hours vs Exam Score (Men vs Women)")
plt.legend()
# Show the Graph
plt.show()
✅ Compares study patterns of men and women using different markers and colors.
6. Scatterplot with Trend Line (Best-Fit Line)
Sometimes, we want to see the trend in our data by adding a best-fit line.
import numpy as np
# Fit a Trend Line
m, b = np.polyfit(hours_studied, exam_scores, 1) # Find slope & intercept
plt.scatter(hours_studied, exam_scores, color='blue', label="Students")
plt.plot(hours_studied, np.array(hours_studied) * m + b, color='red', linestyle="dashed", label="Trend Line")
# Labels and Title
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score (%)")
plt.title("Scatterplot with Trend Line")
plt.legend()
plt.show()
✅ Shows the general trend in the data using a red dashed trend line.
IV. Working with Data
In Data Science, data collection is a crucial step. Data can come from files (CSV, JSON, Excel) or web pages (scraping data from websites).
A. Reading Files (CSV, Excel, Text, JSON)
In Data Science, data is often stored in files like CSV, Excel, JSON, or text files. Python provides powerful libraries like pandas, json, and openpyxl to read and manipulate these files easily.
- CSV - When data is in tabular format (rows and columns).
- Excel - When data is structured with multiple sheets.
- JSON - When data is in key-value pairs (API responses, configurations).
- Text Files - When data is stored as plain text.
1. Reading CSV Files
CSV (Comma-Separated Values) is one of the most commonly used file formats in data science. CSV files store data in rows and columns, separated by commas (,).
(i) Reading a CSV File Using Pandas
import pandas as pd
# Read CSV file into a DataFrame
data = pd.read_csv("data.csv")
# Display first 5 rows
print(data.head())
✅ Example CSV File (data.csv)
| Name | Age | Salary |
|---|---|---|
| Rahul | 25 | 50000 |
| Priya | 28 | 60000 |
| Amit | 30 | 70000 |
(ii) Reading a CSV File Without a Header
If the file does not have column names, specify header=None:
data = pd.read_csv("data.csv", header=None)
print(data.head())
✅ Used when the file does not contain column names.
(iii) Select Specific Columns While Reading
data = pd.read_csv("data.csv", usecols=["Name", "Salary"])
print(data.head())
✅ Loads only selected columns.
(iv) Handling Missing Data
data = pd.read_csv("data.csv", na_values=["", "NA", "null"])
print(data.isnull().sum()) # Check missing values
✅ Replaces empty cells with NaN (Not a Number).
(v) Specifying a Different Delimiter (e.g., ; Instead of ,)
data = pd.read_csv("data.csv", delimiter=";")
2. Reading Excel Files
Excel files (.xlsx) are widely used for data storage and analysis. Excel files store structured data in sheets.
(i) Install OpenPyXL (for Excel support)
pip install openpyxl
(ii) Read an Excel File
data = pd.read_excel("data.xlsx")
print(data.head())
(iii) Read a Specific Sheet from an Excel File
data = pd.read_excel("data.xlsx", sheet_name="SalesData")
print(data.head())
✅ Useful when an Excel file has multiple sheets.
(iv) Save DataFrame to an Excel File
data.to_excel("new_data.xlsx", index=False)
3. Reading JSON Files
JSON (JavaScript Object Notation) is used to store structured data, especially for web applications and APIs. JSON stores data in key-value pairs like Python dictionaries.
(i) Example JSON File (data.json)
{
"name": "Rahul",
"age": 25,
"salary": 50000
}
(ii) Read JSON in Python
import json
with open("data.json", "r") as file:
data = json.load(file)
print(data)
(iii) Convert JSON to Pandas DataFrame
df = pd.DataFrame([data])
print(df)
4. Reading Text Files
Sometimes, data is stored in plain text files (.txt). Text files contain unstructured or plain text data.
(i) Read a Text File Line by Line
with open("data.txt", "r") as file:
for line in file:
print(line.strip()) # Removes extra spaces and newlines
(ii) Read the Whole File at Once
with open("data.txt", "r") as file:
content = file.read()
print(content)
B. Web Scraping – Extracting Data from Websites
Web Scraping is the process of automatically extracting data from websites. It is useful when we need real-time data from sources like news websites, stock markets, weather updates, and e-commerce sites.
✅ Why Use Web Scraping?
- Collect large amounts of data for market research and analysis 📊
- Extract stock prices, sports scores, weather updates 🌦️
- Automate data collection from websites 🤖
- Analyze news articles, job postings, or reviews 📰
Python Libraries for Web Scraping:
🔹 requests → Fetches web pages.
🔹 BeautifulSoup → Extracts data from HTML.
🔹 Selenium → Automates browsing (for dynamic websites).
1. Web Scraping Using BeautifulSoup
(i) Install Required Libraries
pip install beautifulsoup4 requests
(ii) Import Libraries
import requests
from bs4 import BeautifulSoup
2. Scraping a Website's HTML Content
Let's scrape a simple website and extract information.
url = "https://example.com" # Replace with any website
response = requests.get(url) # Get page content
soup = BeautifulSoup(response.text, "html.parser") # Parse HTML
print(soup.title.text) # Extract page title
✅ Extracts the title of the webpage.
3. Extracting Headings from a Website
headings = soup.find_all("h2") # Find all h2 headings
for h in headings:
print(h.text)
✅ Outputs all <h2> headings from the webpage.
4. Extracting Links from a Website
links = soup.find_all("a") # Find all hyperlinks <a> (anchor) tags
for link in links:
print(link.get("href")) # Extract the URL
✅ Finds and prints all links on the webpage.
5. Scraping Data from Tables
Useful for extracting stock market data, sports scores, and job listings.
table = soup.find("table") # Find table
rows = table.find_all("tr") # Find all rows
for row in rows:
columns = row.find_all("td") # Find all columns in each row
data = [col.text for col in columns]
print(data)
✅ Extracts table data row by row.
6. Scraping E-commerce Websites (Example: Product Names & Prices)
url = "https://example-ecommerce.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
products = soup.find_all("div", class_="product") # Find all product containers
for product in products:
name = product.find("h2").text # Extract product name
price = product.find("span", class_="price").text # Extract price
print(f"Product: {name} - Price: {price}")
✅ Extracts product names and prices from an e-commerce page.
7. Handling JavaScript-Rendered Websites Using Selenium
Some websites load data dynamically (e.g., scrolling feeds, login pages) using JavaScript, so BeautifulSoup alone won't work. We can use Selenium to interact with such websites.
(i) Install Selenium & WebDriver
pip install selenium
(ii) Use Selenium to Open a Website
from selenium import webdriver
driver = webdriver.Chrome() # Open Chrome browser
driver.get("https://example.com") # Open website
print(driver.title) # Print page title
driver.quit() # Close browser
✅ Useful for scraping dynamically loaded data.
8. Saving Scraped Data to CSV File
Once we scrape data, we can store it in a CSV file for analysis.
import pandas as pd
data = {
"Product": ["Laptop", "Smartphone"],
"Price": ["₹50,000", "₹30,000"]
}
df = pd.DataFrame(data)
df.to_csv("scraped_data.csv", index=False)
✅ Saves scraped product data in a CSV file.
9. Ethics of Web Scraping 🛑
🚨 Important Guidelines:
- Always check robots.txt of the website before scraping.
- Respect server limits (avoid frequent requests).
- Do not overload a website with frequent requests (use delays).
- Do not scrape login-protected personal or sensitive data.
- Use API if available (APIs are better than scraping).