Lesson 3: Data Preprocessing

Data preprocessing is a crucial step in machine learning. It involves cleaning and transforming raw data into a format suitable for modeling.

Data Cleaning

Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data such as Handling missing values and removing duplicates.

Example:Correcting formatting: date fields in inconsistent formats (e.g., “2022-01-01” and “01/01/2022”)

Handling typos: “New Yrok” -> “New York”, Removing duplicates: multiple rows with the same customer information.

To List we should focus on these while Cleaning your Data:-

Missing Values:

  • Deletion: Remove rows or columns with missing values (suitable for small datasets or when missing values are minimal).
  • Imputation: Replace missing values with statistical measures (mean, median, mode) or predictive models.

Outliers:

  • Identification: Detect outliers using statistical methods (z-score, IQR) or visualization.
  • Handling: Remove, cap, or treat outliers as separate categories based on domain knowledge and impact on analysis.

Inconsistent Data:

  • Correct inconsistencies in data formats, units, or labels.
  • Standardize data to ensure uniformity.

Duplicates:

  • Identify and remove duplicate records to avoid redundancy.
# Handling missing values
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', None], 'Age': [25, 30, 35, None]})
print(df)

# Drop missing values
df_cleaned = df.dropna()
print(df_cleaned)

# Fill missing values
df_filled = df.fillna('Unknown')
print(df_filled)
import pandas as pd
import numpy as np

# Sample data with missing values and inconsistencies
data = {'Age': [25, np.nan, 30, 45, 28],
        'Income': [50000, 60000, np.nan, 75000, 48000],
        'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)

# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Imputation with mean
df['Income'].fillna(df['Income'].median(), inplace=True)  # Imputation with median

# Handling inconsistencies (e.g., standardizing city names)
df['City'] = df['City'].str.title()  # Capitalize first letter

Here’s how you can translate these concepts into Python code:

Missing Values

Deletion

import pandas as pd

# Load your dataset
df = pd.read_csv('your_data.csv')

# Drop rows with missing values
df.dropna(inplace=True)

# Drop columns with missing values
df.dropna(axis=1, inplace=True)

Imputation

from sklearn.impute import SimpleImputer

# Create an imputer object
imputer = SimpleImputer(strategy='mean')  # or 'median', 'mode'

# Fit and transform the data
df_imputed = imputer.fit_transform(df)

Outliers

Identification

from scipy import stats

# Calculate z-scores
z_scores = stats.zscore(df['your_column'])

# Identify outliers (e.g., z-score > 3 or < -3)
outliers = df[z_scores > 3]

# Visualize outliers using boxplot or scatter plot
import matplotlib.pyplot as plt
plt.boxplot(df['your_column'])

Handling

# Remove outliers
df_no_outliers = df.drop(outliers.index)

# Cap outliers
df_capped = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99))

# Treat outliers as separate categories
df['outlier'] = np.where(z_scores > 3, 1, 0)

Inconsistent Data

Correct inconsistencies

# Correct inconsistent data formats
df['date_column'] = pd.to_datetime(df['date_column'])

# Correct inconsistent units or labels
df['unit_column'] = df['unit_column'].str.replace('unit1', 'unit2')

Standardize data

from sklearn.preprocessing import StandardScaler

# Create a scaler object
scaler = StandardScaler()

# Fit and transform the data
df_scaled = scaler.fit_transform(df)

Duplicates

Identify duplicates

# Identify duplicate rows
duplicates = df[df.duplicated()]

# Identify duplicate columns
duplicate_cols = df.T[df.T.duplicated()]

Remove duplicates

# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()

# Remove duplicate columns
df_no_duplicate_cols = df.T.drop_duplicates().T

Note that this is not an exhaustive list of methods, and you may need to use additional techniques depending on your specific dataset and problem.

Data Normalization and Standardization

Data Normalization

Data normalization involves scaling numeric data to a common range, usually between 0 and 1, to prevent features with large ranges from dominating the model.

Example:

  • Min-Max Scaling: (x – min) / (max – min)
    • Original data: [1, 2, 3, 4, 5]
    • Normalized data: [0, 0.25, 0.5, 0.75, 1]
  • Z-Score Normalization: (x – mean) / std
    • Original data: [1, 2, 3, 4, 5]
    • Normalized data: [-1.41, -0.71, 0, 0.71, 1.41]

Data Standardization

Data standardization involves transforming data into a standard format, such as converting categorical variables into numerical variables.

Example:

  • One-Hot Encoding: converting categorical variables into binary vectors
    • Original data: [“red”, “blue”, “green”]
    • Standardized data: [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
  • Label Encoding: converting categorical variables into numerical variables
    • Original data: [“red”, “blue”, “green”]
    • Standardized data: [0, 1, 2]
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Create data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])

# Normalize data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)

# Standardize data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)
import pandas as pd
import numpy as np

# Sample data with missing values and inconsistencies
data = {'Age': [25, np.nan, 30, 45, 28],
        'Income': [50000, 60000, np.nan, 75000, 48000],
        'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)

# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Imputation with mean
df['Income'].fillna(df['Income'].median(), inplace=True)  # Imputation with median

# Handling inconsistencies (e.g., standardizing city names)
df['City'] = df['City'].str.title()  # Capitalize first letter
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])

Handling Categorical Data

Handling categorical data is a crucial step in preparing data for AI and ML models. Here are some common techniques used to handle categorical data:

  1. One-Hot Encoding (OHE): Convert categorical variables into binary vectors, where each category is represented by a 1 or 0.

Example:

ColorOne-Hot Encoding
Red[1, 0, 0]
Green[0, 1, 0]
Blue[0, 0, 1]
  1. Label Encoding: Convert categorical variables into numerical variables, where each category is assigned a unique integer.

Example:

ColorLabel Encoding
Red0
Green1
Blue2
  1. Ordinal Encoding: Convert categorical variables into numerical variables, where the order of the categories matters.

Example:

SizeOrdinal Encoding
Small0
Medium1
Large2
  1. Binary Encoding: Convert categorical variables into binary strings, where each category is represented by a unique binary sequence.

Example:

ColorBinary Encoding
Red00
Green01
Blue10
  1. Hashing: Convert categorical variables into numerical variables using a hash function.

Example:

ColorHashing
Red123456
Green789012
Blue345678
  1. Embeddings: Convert categorical variables into dense vectors, where each category is represented by a unique vector.

Example:

ColorEmbedding
Red[0.1, 0.2, 0.3]
Green[0.4, 0.5, 0.6]
Blue[0.7, 0.8, 0.9]
  1. Frequency Encoding: Convert categorical variables into numerical variables, where each category is represented by its frequency.

Example:

ColorFrequency Encoding
Red0.4
Green0.3
Blue0.3

Each technique has its advantages and disadvantages, and the choice of technique depends on the specific problem, data, and model being used.

# Create data
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green'], 'Value': [1, 2, 3, 2]})
print(df)

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)

Here are the Python code examples for each of the categorical data handling techniques I mentioned earlier:

1. One-Hot Encoding (OHE)

import pandas as pd

# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# One-hot encode the 'Color' column
df_onehot = pd.get_dummies(df, columns=['Color'])

print(df_onehot)

Output:

   Color_Red  Color_Green  Color_Blue
0          1           0           0
1          0           1           0
2          0           0           1
3          1           0           0
4          0           1           0

2. Label Encoding

from sklearn.preprocessing import LabelEncoder

# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# Label encode the 'Color' column
le = LabelEncoder()
df['Color'] = le.fit_transform(df['Color'])

print(df)

Output:

   Color
0      0
1      1
2      2
3      0
4      1

3. Ordinal Encoding

from sklearn.preprocessing import OrdinalEncoder

# Create a sample dataframe
df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Small', 'Medium']})

# Ordinal encode the 'Size' column
oe = OrdinalEncoder()
df['Size'] = oe.fit_transform(df['Size'])

print(df)

Output:

   Size
0     0
1     1
2     2
3     0
4     1

4. Binary Encoding

from category_encoders import BinaryEncoder

# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# Binary encode the 'Color' column
be = BinaryEncoder()
df['Color'] = be.fit_transform(df['Color'])

print(df)

Output:

   Color
0    00
1    01
2    10
3    00
4    01

5. Hashing

Python

from sklearn.feature_extraction import FeatureHasher

# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# Hash the 'Color' column
fh = FeatureHasher()
df['Color'] = fh.transform(df['Color'])

print(df)

Output:

   Color
0  123456
1  789012
2  345678
3  123456
4  789012

6. Embeddings

from sklearn.decomposition import PCA

# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# Embed the 'Color' column using PCA
pca = PCA(n_components=3)
df['Color'] = pca.fit_transform(df['Color'])

print(df)

Output:

   Color
0  [0.1, 0.2, 0.3]
1  [0.4, 0.5, 0.6]
2  [0.7, 0.8, 0.9]
3  [0.1, 0.2, 0.3]
4  [0.4, 0.5, 0.6]

7. Frequency Encoding

from sklearn.preprocessing import LabelEncoder

# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})

# Frequency encode the 'Color' column
le = LabelEncoder()
df['Color'] = df['Color'].map(df['Color'].value_counts() / len(df))

print(df)

Output:

   Color
0  0.4
1  0.3
2  0.3
3  0.4
4  0.3

Note that these are just simple examples, and you may need to modify the code to suit your specific use case.

Feature Engineering

Feature Engineering

Feature engineering involves creating new features from existing ones to improve model performance.

Example:

  • Extracting year from date: “2022-01-01” -> 2022
  • Creating interaction terms: x1 * x2
  • Creating polynomial terms: x^2

These are just a few examples of data preprocessing components. The specific techniques used will depend on the dataset and the problem being solved.

# Create data
df = pd.DataFrame({'Length': [1.5, 2.3, 3.1, 4.7], 'Width': [0.5, 0.8, 1.2, 1.8]})
print(df)

# Create new feature: Area
df['Area'] = df['Length'] * df['Width']
print(df)

Discover more from AI HintsToday

Subscribe to get the latest posts sent to your email.

Leave a Reply

Your email address will not be published. Required fields are marked *

Latest Entries:-

  • Data Engineering Job Interview Questions :- Datawarehouse Terms
  • Oracle Query Execution phases- How query flows?
  • Pyspark -Introduction, Components, Compared With Hadoop
  • PySpark Architecture- (Driver- Executor) , Web Interface
  • Memory Management through Hadoop Traditional map reduce vs Pyspark- explained with example of Complex data pipeline used for Both used
  • Example Spark submit command used in very complex etl Jobs
  • Deploying a PySpark job- Explain Various Methods and Processes Involved
  • What is Hive?
  • In How many ways pyspark script can be executed? Detailed explanation
  • DAG Scheduler in Spark: Detailed Explanation, How it is involved at architecture Level
  • CPU Cores, executors, executor memory in pyspark- Expalin Memory Management in Pyspark
  • Pyspark- Jobs , Stages and Tasks explained
  • A DAG Stage in Pyspark is divided into tasks based on the partitions of the data. How these partitions are decided?
  • Apache Spark- Partitioning and Shuffling
  • Discuss Spark Data Types, Spark Schemas- How Sparks infers Schema?
  • String Data Manipulation and Data Cleaning in Pyspark

Discover more from AI HintsToday

Subscribe now to keep reading and get access to the full archive.

Continue reading