Data preprocessing is a crucial step in machine learning. It involves cleaning and transforming raw data into a format suitable for modeling.
Data Cleaning
Data cleaning involves identifying and correcting errors, inconsistencies, and inaccuracies in the data such as Handling missing values and removing duplicates.
Example:Correcting formatting: date fields in inconsistent formats (e.g., “2022-01-01” and “01/01/2022”)
Handling typos: “New Yrok” -> “New York”, Removing duplicates: multiple rows with the same customer information.
To List we should focus on these while Cleaning your Data:-
Missing Values:
- Deletion: Remove rows or columns with missing values (suitable for small datasets or when missing values are minimal).
- Imputation: Replace missing values with statistical measures (mean, median, mode) or predictive models.
Outliers:
- Identification: Detect outliers using statistical methods (z-score, IQR) or visualization.
- Handling: Remove, cap, or treat outliers as separate categories based on domain knowledge and impact on analysis.
Inconsistent Data:
- Correct inconsistencies in data formats, units, or labels.
- Standardize data to ensure uniformity.
Duplicates:
- Identify and remove duplicate records to avoid redundancy.
# Handling missing values
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', None], 'Age': [25, 30, 35, None]})
print(df)
# Drop missing values
df_cleaned = df.dropna()
print(df_cleaned)
# Fill missing values
df_filled = df.fillna('Unknown')
print(df_filled)
import pandas as pd
import numpy as np
# Sample data with missing values and inconsistencies
data = {'Age': [25, np.nan, 30, 45, 28],
'Income': [50000, 60000, np.nan, 75000, 48000],
'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)
# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True) # Imputation with mean
df['Income'].fillna(df['Income'].median(), inplace=True) # Imputation with median
# Handling inconsistencies (e.g., standardizing city names)
df['City'] = df['City'].str.title() # Capitalize first letter
Here’s how you can translate these concepts into Python code:
Missing Values
Deletion
import pandas as pd
# Load your dataset
df = pd.read_csv('your_data.csv')
# Drop rows with missing values
df.dropna(inplace=True)
# Drop columns with missing values
df.dropna(axis=1, inplace=True)
Imputation
from sklearn.impute import SimpleImputer
# Create an imputer object
imputer = SimpleImputer(strategy='mean') # or 'median', 'mode'
# Fit and transform the data
df_imputed = imputer.fit_transform(df)
Outliers
Identification
from scipy import stats
# Calculate z-scores
z_scores = stats.zscore(df['your_column'])
# Identify outliers (e.g., z-score > 3 or < -3)
outliers = df[z_scores > 3]
# Visualize outliers using boxplot or scatter plot
import matplotlib.pyplot as plt
plt.boxplot(df['your_column'])
Handling
# Remove outliers
df_no_outliers = df.drop(outliers.index)
# Cap outliers
df_capped = df.clip(lower=df.quantile(0.01), upper=df.quantile(0.99))
# Treat outliers as separate categories
df['outlier'] = np.where(z_scores > 3, 1, 0)
Inconsistent Data
Correct inconsistencies
# Correct inconsistent data formats
df['date_column'] = pd.to_datetime(df['date_column'])
# Correct inconsistent units or labels
df['unit_column'] = df['unit_column'].str.replace('unit1', 'unit2')
Standardize data
from sklearn.preprocessing import StandardScaler
# Create a scaler object
scaler = StandardScaler()
# Fit and transform the data
df_scaled = scaler.fit_transform(df)
Duplicates
Identify duplicates
# Identify duplicate rows
duplicates = df[df.duplicated()]
# Identify duplicate columns
duplicate_cols = df.T[df.T.duplicated()]
Remove duplicates
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
# Remove duplicate columns
df_no_duplicate_cols = df.T.drop_duplicates().T
Note that this is not an exhaustive list of methods, and you may need to use additional techniques depending on your specific dataset and problem.
Data Normalization and Standardization
Data Normalization
Data normalization involves scaling numeric data to a common range, usually between 0 and 1, to prevent features with large ranges from dominating the model.
Example:
- Min-Max Scaling: (x – min) / (max – min)
- Original data: [1, 2, 3, 4, 5]
- Normalized data: [0, 0.25, 0.5, 0.75, 1]
- Z-Score Normalization: (x – mean) / std
- Original data: [1, 2, 3, 4, 5]
- Normalized data: [-1.41, -0.71, 0, 0.71, 1.41]
Data Standardization
Data standardization involves transforming data into a standard format, such as converting categorical variables into numerical variables.
Example:
- One-Hot Encoding: converting categorical variables into binary vectors
- Original data: [“red”, “blue”, “green”]
- Standardized data: [[1, 0, 0], [0, 1, 0], [0, 0, 1]]
- Label Encoding: converting categorical variables into numerical variables
- Original data: [“red”, “blue”, “green”]
- Standardized data: [0, 1, 2]
from sklearn.preprocessing import MinMaxScaler, StandardScaler
# Create data
data = np.array([[1, 2], [2, 4], [3, 6], [4, 8], [5, 10]])
# Normalize data
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(data)
print(normalized_data)
# Standardize data
scaler = StandardScaler()
standardized_data = scaler.fit_transform(data)
print(standardized_data)
import pandas as pd
import numpy as np
# Sample data with missing values and inconsistencies
data = {'Age': [25, np.nan, 30, 45, 28],
'Income': [50000, 60000, np.nan, 75000, 48000],
'City': ['New York', 'Los Angeles', 'New York', 'Chicago', 'Los Angeles']}
df = pd.DataFrame(data)
# Handling missing values
df['Age'].fillna(df['Age'].mean(), inplace=True) # Imputation with mean
df['Income'].fillna(df['Income'].median(), inplace=True) # Imputation with median
# Handling inconsistencies (e.g., standardizing city names)
df['City'] = df['City'].str.title() # Capitalize first letter
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['Age', 'Income']] = scaler.fit_transform(df[['Age', 'Income']])
Handling Categorical Data
Handling categorical data is a crucial step in preparing data for AI and ML models. Here are some common techniques used to handle categorical data:
- One-Hot Encoding (OHE): Convert categorical variables into binary vectors, where each category is represented by a 1 or 0.
Example:
Color | One-Hot Encoding |
---|---|
Red | [1, 0, 0] |
Green | [0, 1, 0] |
Blue | [0, 0, 1] |
- Label Encoding: Convert categorical variables into numerical variables, where each category is assigned a unique integer.
Example:
Color | Label Encoding |
---|---|
Red | 0 |
Green | 1 |
Blue | 2 |
- Ordinal Encoding: Convert categorical variables into numerical variables, where the order of the categories matters.
Example:
Size | Ordinal Encoding |
---|---|
Small | 0 |
Medium | 1 |
Large | 2 |
- Binary Encoding: Convert categorical variables into binary strings, where each category is represented by a unique binary sequence.
Example:
Color | Binary Encoding |
---|---|
Red | 00 |
Green | 01 |
Blue | 10 |
- Hashing: Convert categorical variables into numerical variables using a hash function.
Example:
Color | Hashing |
---|---|
Red | 123456 |
Green | 789012 |
Blue | 345678 |
- Embeddings: Convert categorical variables into dense vectors, where each category is represented by a unique vector.
Example:
Color | Embedding |
---|---|
Red | [0.1, 0.2, 0.3] |
Green | [0.4, 0.5, 0.6] |
Blue | [0.7, 0.8, 0.9] |
- Frequency Encoding: Convert categorical variables into numerical variables, where each category is represented by its frequency.
Example:
Color | Frequency Encoding |
---|---|
Red | 0.4 |
Green | 0.3 |
Blue | 0.3 |
Each technique has its advantages and disadvantages, and the choice of technique depends on the specific problem, data, and model being used.
# Create data
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Green'], 'Value': [1, 2, 3, 2]})
print(df)
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)
Here are the Python code examples for each of the categorical data handling techniques I mentioned earlier:
1. One-Hot Encoding (OHE)
import pandas as pd
# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# One-hot encode the 'Color' column
df_onehot = pd.get_dummies(df, columns=['Color'])
print(df_onehot)
Output:
Color_Red Color_Green Color_Blue
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0
2. Label Encoding
from sklearn.preprocessing import LabelEncoder
# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# Label encode the 'Color' column
le = LabelEncoder()
df['Color'] = le.fit_transform(df['Color'])
print(df)
Output:
Color
0 0
1 1
2 2
3 0
4 1
3. Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder
# Create a sample dataframe
df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Small', 'Medium']})
# Ordinal encode the 'Size' column
oe = OrdinalEncoder()
df['Size'] = oe.fit_transform(df['Size'])
print(df)
Output:
Size
0 0
1 1
2 2
3 0
4 1
4. Binary Encoding
from category_encoders import BinaryEncoder
# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# Binary encode the 'Color' column
be = BinaryEncoder()
df['Color'] = be.fit_transform(df['Color'])
print(df)
Output:
Color
0 00
1 01
2 10
3 00
4 01
5. Hashing
Python
from sklearn.feature_extraction import FeatureHasher
# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# Hash the 'Color' column
fh = FeatureHasher()
df['Color'] = fh.transform(df['Color'])
print(df)
Output:
Color
0 123456
1 789012
2 345678
3 123456
4 789012
6. Embeddings
from sklearn.decomposition import PCA
# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# Embed the 'Color' column using PCA
pca = PCA(n_components=3)
df['Color'] = pca.fit_transform(df['Color'])
print(df)
Output:
Color
0 [0.1, 0.2, 0.3]
1 [0.4, 0.5, 0.6]
2 [0.7, 0.8, 0.9]
3 [0.1, 0.2, 0.3]
4 [0.4, 0.5, 0.6]
7. Frequency Encoding
from sklearn.preprocessing import LabelEncoder
# Create a sample dataframe
df = pd.DataFrame({'Color': ['Red', 'Green', 'Blue', 'Red', 'Green']})
# Frequency encode the 'Color' column
le = LabelEncoder()
df['Color'] = df['Color'].map(df['Color'].value_counts() / len(df))
print(df)
Output:
Color
0 0.4
1 0.3
2 0.3
3 0.4
4 0.3
Note that these are just simple examples, and you may need to modify the code to suit your specific use case.
Feature Engineering
Feature Engineering
Feature engineering involves creating new features from existing ones to improve model performance.
Example:
- Extracting year from date: “2022-01-01” -> 2022
- Creating interaction terms: x1 * x2
- Creating polynomial terms: x^2
These are just a few examples of data preprocessing components. The specific techniques used will depend on the dataset and the problem being solved.
# Create data
df = pd.DataFrame({'Length': [1.5, 2.3, 3.1, 4.7], 'Width': [0.5, 0.8, 1.2, 1.8]})
print(df)
# Create new feature: Area
df['Area'] = df['Length'] * df['Width']
print(df)
Leave a Reply