This tutorial covers a wide range of pandas operations and advanced concepts with examples that are practical and useful in real-world scenarios. The key topics include:
- Creating DataFrames, Series from various sources.
- Checking and changing data types.
- Looping through DataFrames efficiently.
- Handling missing data.
- Advanced data manipulation techniques like filtering, sorting, grouping, and merging.
- Window functions.
- Error handling and control mechanisms.
In pandas, there are several core data structures designed for handling different types of data, enabling efficient and flexible data manipulation. These data structures include:
- Series
- DataFrame
- Index
- MultiIndex
- Panel (Deprecated)
Each of these structures serves different purposes, from storing one-dimensional data to complex, multi-dimensional datasets.
1. Series
A Series is a one-dimensional array with labels, capable of holding any data type (integers, strings, floats, Python objects, etc.). It can be thought of as a column in a DataFrame or a labeled list. Each item in a Series has a unique index.
Characteristics of Series:
- One-dimensional.
- Stores homogeneous data (though technically it can hold mixed data types).
- Has an index for each element, which can be customized.
Creating a Series:
import pandas as pd
# Creating a Series from a list
data = [10, 20, 30, 40]
s = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(s)
Output:
a 10
b 20
c 30
d 40
dtype: int64
2. DataFrame
A DataFrame is a two-dimensional, tabular data structure with labeled axes (rows and columns). It is the most commonly used pandas structure for handling datasets, where each column is a Series, and columns can have different data types.
Characteristics of DataFrame:
- Two-dimensional (rows and columns).
- Stores heterogeneous data types (each column can be of a different type).
- Has labeled axes (row index and column labels).
- Flexible and highly customizable.
Creating a DataFrame:
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [24, 27, 22],
'Salary': [70000, 80000, 60000]
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Salary
0 Alice 24 70000
1 Bob 27 80000
2 Charlie 22 60000
3. Index
An Index is an immutable data structure that labels the rows and columns of a pandas Series or DataFrame. It is a key part of pandas’ design, allowing for efficient data alignment and retrieval.
Characteristics of Index:
- It is immutable (cannot be changed after creation).
- Allows for efficient label-based data retrieval.
- Supports custom and multi-level indexing.
Creating an Index:
# Creating a custom Index for a DataFrame
index = pd.Index(['a', 'b', 'c'])
data = {'Value': [10, 20, 30]}
df = pd.DataFrame(data, index=index)
print(df)
Output:
Value
a 10
b 20
c 30
4. MultiIndex
A MultiIndex is an advanced index structure that allows for multiple levels or hierarchical indexing. This enables you to create a DataFrame with more complex indexing, which is useful when working with data that has multiple dimensions (e.g., time series data with both date and time).
Characteristics of MultiIndex:
- Supports hierarchical (multi-level) indexing.
- Facilitates working with higher-dimensional data in a 2D DataFrame.
- Provides powerful data slicing and subsetting capabilities.
Creating a MultiIndex DataFrame:
arrays = [
['A', 'A', 'B', 'B'],
['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Upper', 'Lower'))
data = {'Value': [1, 2, 3, 4]}
df = pd.DataFrame(data, index=index)
print(df)
Output:
Value
Upper Lower
A one 1
two 2
B one 3
two 4
5. Panel (Deprecated)
Panel was a three-dimensional data structure in pandas that allowed for storing 3D data, making it possible to have multiple DataFrames within one structure. However, it was deprecated in pandas 0.25.0 and later removed, as the community now recommends using MultiIndex DataFrames or xarray (for multidimensional data).
Alternative:
- For handling three or more dimensions, use a MultiIndex DataFrame or the xarray library, which is specifically designed for multidimensional arrays.
Summary Table
Data Structure | Dimensions | Description | Usage |
---|---|---|---|
Series | 1D | One-dimensional array with labeled indices. | Storing single columns or lists with labels. |
DataFrame | 2D | Two-dimensional, table-like structure with labeled rows and columns. | Primary data structure for handling tabular data. |
Index | 1D | Immutable labels for rows and columns in Series and DataFrames. | Used for indexing and aligning data. |
MultiIndex | 2D (multi-level) | Hierarchical indexing allowing multiple levels of row/column labels. | Organizing complex data, such as time series with multiple dimensions. |
Panel | 3D (Deprecated) | Three-dimensional data structure (removed from pandas). | Use MultiIndex DataFrames or xarray instead for multidimensional data. |
Use Cases for Each Data Structure
- Series:
- Ideal for handling individual columns in isolation.
- Useful when working with single-column data or performing calculations on one dimension.
- DataFrame:
- Perfect for most data analysis tasks with two-dimensional data.
- Used for data cleaning, manipulation, transformation, and visualization.
- Index:
- Helps to uniquely identify each row or column in Series and DataFrames.
- Useful for efficient data selection and alignment.
- MultiIndex:
- Essential for handling multi-dimensional data within a 2D structure.
- Allows for advanced slicing and subsetting in complex datasets.
Creating DataFrames from various sources.
Pandas provides various ways to create DataFrames from different data sources. In this section, we’ll go over the most common and practical methods for creating DataFrames using data from:
- Python data structures (lists, dictionaries, tuples, arrays)
- CSV files
- Excel files
- JSON files
- SQL databases
- Web data (HTML tables)
- Other sources (e.g., clipboard, API)
1. Creating DataFrames from Python Data Structures
1.1. From a Dictionary
A dictionary in Python is a very common way to create a DataFrame, where keys represent column names and values represent data.
import pandas as pd
# Example 1: Dictionary of lists
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
print(df)
fruits = pd.DataFrame({‘Apple’:[30],’Banana’:[21]})
why not fruits = pd.DataFrame({‘Apple’:30,’Banana’:21}) works
The reason fruits = pd.DataFrame({'Apple':30,'Banana':21})
doesn’t work as expected is due to how Pandas’ DataFrame constructor handles dictionary inputs.
Short answer:
When passing a dictionary to pd.DataFrame()
, Pandas expects the dictionary values to be lists or arrays, not scalars.
Long answer:
When you pass a dictionary to pd.DataFrame()
, Pandas treats each dictionary key as a column name and each corresponding value as a column of data. Pandas expects the column data to be:
- Lists:
[value1, value2, ..., valueN]
- Arrays:
np.array([value1, value2, ..., valueN])
- Series:
pd.Series([value1, value2, ..., valueN])
However, in your example, {'Apple': 30, 'Banana': 21}
, the dictionary values are scalars (integers), not lists or arrays. Pandas doesn’t know how to handle scalar values as column data.
What happens when you pass scalars:
When you pass scalars, Pandas will attempt to create a DataFrame with a single row, but it will throw an error because it expects an index (row label) for that single row.
Why {'Apple':[30],'Banana':[21]}
works:
By wrapping the scalar values in lists ([30]
and [21]
), you’re providing Pandas with the expected input format. Pandas creates a DataFrame with a single row and two columns, where each column contains a single value.
Alternative solution:
If you prefer not to wrap scalars in lists, you can use the pd.DataFrame()
constructor with the index
parameter to specify the row label:
fruits = pd.DataFrame({'Apple': 30, 'Banana': 21}, index=[0])
This creates a DataFrame with a single row labeled 0
.
Best practice:
For consistency and clarity, it’s recommended to always pass lists or arrays as dictionary values when creating DataFrames:
fruits = pd.DataFrame({'Apple': [30], 'Banana': [21]})
1.2. From a List of Lists
You can create a DataFrame by passing a list of lists and specifying column names.
# Example 2: List of lists
data = [
['Alice', 25, 50000],
['Bob', 30, 60000],
['Charlie', 35, 70000]
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)
1.3. From a List of Dictionaries
Each dictionary in the list represents a row, where keys are column names.
# Example 3: List of dictionaries
data = [
{'Name': 'Alice', 'Age': 25, 'Salary': 50000},
{'Name': 'Bob', 'Age': 30, 'Salary': 60000},
{'Name': 'Charlie', 'Age': 35, 'Salary': 70000}
]
df = pd.DataFrame(data)
print(df)
1.4. From Tuples
A DataFrame can also be created from a list of tuples.
# Example 4: List of tuples
data = [
('Alice', 25, 50000),
('Bob', 30, 60000),
('Charlie', 35, 70000)
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)
data = [ (‘Alice’, 25, 50000), (‘Bob’, 30, 60000), (‘Charlie’, 35, 70000) ]
df = pd.DataFrame(data, columns=[‘Name’, ‘Age’, ‘Salary’]) , so How is A tuple being presented in Pandas in above example?
data = [
('Alice', 25, 50000),
('Bob', 30, 60000),
('Charlie', 35, 70000)
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)
Each tuple (('Alice', 25, 50000)
, etc.) represents a row in the DataFrame. The tuple elements are not scalars in this context; instead, they are row values.
Pandas treats each tuple as a single row, where:
- The first element of the tuple (
'Alice'
) corresponds to the'Name'
column. - The second element (
25
) corresponds to the'Age'
column. - The third element (
50000
) corresponds to the'Salary'
column.
By passing a list of tuples, you’re providing Pandas with a clear indication of the row structure.
Under the hood, Pandas converts each tuple into a:
- list (in Python 3.x) or
- tuple (in Python 2.x)
which is then used to create the DataFrame.
Equivalent representation:
You can achieve the same result using lists instead of tuples:
data = [
['Alice', 25, 50000],
['Bob', 30, 60000],
['Charlie', 35, 70000]
]
df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)
Or, using dictionaries:
data = [
{'Name': 'Alice', 'Age': 25, 'Salary': 50000},
{'Name': 'Bob', 'Age': 30, 'Salary': 60000},
{'Name': 'Charlie', 'Age': 35, 'Salary': 70000}
]
df = pd.DataFrame(data)
print(df)
All three representations (tuples, lists, and dictionaries) convey the same information to Pandas: a collection of rows with corresponding column values.
1.5. From a NumPy Array
You can create a DataFrame using a NumPy array.
import numpy as np
# Example 5: NumPy array
data = np.array([[25, 50000], [30, 60000], [35, 70000]])
df = pd.DataFrame(data, columns=['Age', 'Salary'])
print(df)
2. Creating DataFrames from Files
2.1. From a CSV File
The most common way to create a DataFrame is by loading data from a CSV file.
# Example 6: CSV file
df = pd.read_csv('path/to/file.csv')
print(df.head())
- Additional options:
- Use
sep=";"
for custom delimiters. - Use
index_col=0
to set a specific column as the index.
- Use
2.2. From an Excel File
You can read data from Excel files using pd.read_excel()
.
# Example 7: Excel file
df = pd.read_excel('path/to/file.xlsx', sheet_name='Sheet1')
print(df.head())
2.3. From a JSON File
Data can also be read from JSON files.
# Example 8: JSON file
df = pd.read_json('path/to/file.json')
print(df.head())
3. Creating DataFrames from SQL Databases
3.1. From SQL Databases
You can fetch data from a SQL database using pd.read_sql()
or pd.read_sql_query()
.
import sqlite3
# Example 9: SQL database (SQLite)
conn = sqlite3.connect('my_database.db')
df = pd.read_sql('SELECT * FROM my_table', conn)
print(df.head())
- You can also connect to databases like MySQL, PostgreSQL, etc., using appropriate connectors like
mysql-connector
orpsycopg2
.
4. Creating DataFrames from Web Data
4.1. From HTML Tables
You can directly extract tables from a webpage using pd.read_html()
. This returns a list of DataFrames for each table found.
# Example 10: Web data (HTML tables)
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
dfs = pd.read_html(url)
print(dfs[0].head()) # Display the first table
5. Creating DataFrames from the Clipboard
5.1. From Clipboard
You can copy data (e.g., from a CSV or Excel table) to your clipboard and read it into a DataFrame using pd.read_clipboard()
.
# Example 11: Clipboard
df = pd.read_clipboard()
print(df)
6. Creating DataFrames Programmatically
6.1. Empty DataFrame
An empty DataFrame can be created by simply calling the DataFrame constructor.
# Example 12: Empty DataFrame
df = pd.DataFrame()
print(df)
6.2. With Initial Columns
You can create an empty DataFrame with predefined columns.
# Example 13: DataFrame with predefined columns
df = pd.DataFrame(columns=['Name', 'Age', 'Salary'])
print(df)
6.3. Appending Rows to DataFrame
You can programmatically append rows to a DataFrame.
# Example 14: Appending rows
df = pd.DataFrame(columns=['Name', 'Age', 'Salary'])
df = df.append({'Name': 'Alice', 'Age': 25, 'Salary': 50000}, ignore_index=True)
df = df.append({'Name': 'Bob', 'Age': 30, 'Salary': 60000}, ignore_index=True)
print(df)
7. Creating DataFrames from APIs
7.1. From an API Response (JSON)
You can create a DataFrame from JSON responses from APIs.
import requests
# Example 15: API Response (JSON)
url = 'https://api.example.com/data'
response = requests.get(url)
data = response.json()
# Convert JSON data to DataFrame
df = pd.DataFrame(data)
print(df.head())
8. Creating DataFrames with MultiIndex
8.1. MultiIndex DataFrame
You can create a DataFrame with hierarchical indexing (MultiIndex).
# Example 16: MultiIndex
arrays = [['North', 'North', 'South', 'South'], ['City1', 'City2', 'City3', 'City4']]
index = pd.MultiIndex.from_arrays(arrays, names=('Region', 'City'))
data = {'Population': [100000, 200000, 150000, 130000], 'GDP': [300, 400, 500, 600]}
df = pd.DataFrame(data, index=index)
print(df)
There are many ways to create pandas DataFrames from various sources and data structures, including:
- Python data structures: lists, dictionaries, tuples, arrays.
- Files: CSV, Excel, JSON.
- SQL databases: SQLite, MySQL, PostgreSQL.
- Web data: HTML tables.
- Clipboard: Copy-pasting directly into a DataFrame.
- APIs: Fetching JSON data from web APIs
Viewing, accessing and checking DataFrames in Pandas
In pandas, viewing, accessing, checking data types, and generating descriptive statistics are fundamental steps when working with DataFrames. These operations help in understanding the structure, content, and type of the data before performing any data analysis or manipulation.
Let’s explore these concepts with detailed explanations, functions, and examples.
1. Viewing Data in pandas DataFrames
Before performing any transformations or analysis, it’s important to view the data to understand its structure and content.
1.1. Viewing the First or Last Rows
df.head(n)
: Returns the firstn
rows (default: 5 rows).df.tail(n)
: Returns the lastn
rows (default: 5 rows).
import pandas as pd
# Example DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
'Age': [25, 30, 35, 40, 28],
'Salary': [50000, 60000, 70000, 80000, 55000]}
df = pd.DataFrame(data)
# View the first 3 rows
print(df.head(3))
# View the last 2 rows
print(df.tail(2))
1.2. Viewing Specific Columns
You can select and view specific columns of the DataFrame.
df['column_name']
: Access a single column.df[['col1', 'col2']]
: Access multiple columns.
# View the 'Name' column
print(df['Name'])
# View 'Name' and 'Salary' columns
print(df[['Name', 'Salary']])
1.3. Viewing DataFrame Shape
df.shape
: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
# Get the shape of the DataFrame
print(df.shape) # Output: (5, 3) -> 5 rows, 3 columns
1.4. Viewing DataFrame Columns
df.columns
: Returns an Index object containing the column names.
# View column names
print(df.columns)
1.5. Viewing Data Types of Columns
df.dtypes
: Returns the data types of each column in the DataFrame.
# View data types of each column
print(df.dtypes)
2. Accessing Data in pandas DataFrames
There are various ways to access specific rows and columns in pandas.
2.1. Accessing Rows by Index
You can access rows using iloc[]
(integer-location) or loc[]
(label-location).
df.iloc[]
: Access rows by their integer position.df.loc[]
: Access rows and columns by label.
# Access the first row (index 0)
print(df.iloc[0])
# Access row where index is 2
print(df.loc[2])
2.2. Accessing Rows and Columns
df.loc[row, column]
: Access a specific element using labels.df.iloc[row, column]
: Access a specific element using integer positions.
# Access the 'Salary' of the second row
print(df.loc[1, 'Salary'])
# Access the 'Name' of the third row
print(df.iloc[2, 0])
2.3. Accessing Rows Based on Conditions
You can filter rows based on conditions (boolean indexing).
# Access rows where Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)
3. Checking and Changing Data Types of Columns
Understanding the data types of columns is crucial when performing data analysis, as different operations are valid for different data types.
3.1. Checking Data Types
df.dtypes
: Returns the data type of each column.
# Check the data types of all columns
print(df.dtypes)
3.2. Changing Data Types
You can change the data type of columns using astype()
.
# Convert 'Age' column to float
df['Age'] = df['Age'].astype(float)
print(df.dtypes)
# Convert 'Salary' column to string
df['Salary'] = df['Salary'].astype(str)
print(df.dtypes)
3.3. Converting to Datetime
You can convert columns to datetime using pd.to_datetime()
.
# Convert 'Date' column to datetime format
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])
print(df.dtypes)
4. Descriptive Statistics in pandas
Descriptive statistics allow you to quickly summarize the central tendency, dispersion, and shape of a dataset’s distribution.
4.1. Generating Descriptive Statistics
df.describe()
: Provides summary statistics (count, mean, std, min, max, etc.) for numerical columns.df.describe(include='all')
: Includes both numerical and categorical columns in the summary.
# Summary statistics for numerical columns
print(df.describe())
# Summary statistics including categorical columns
print(df.describe(include='all'))
4.2. Summary for Specific Columns
You can generate descriptive statistics for specific columns.
# Summary statistics for 'Salary' column
print(df['Salary'].describe())
4.3. Counting Unique Values
df['column'].nunique()
: Counts the number of unique values in a column.df['column'].unique()
: Lists the unique values in a column.
# Count unique values in the 'Age' column
print(df['Age'].nunique())
# Get unique values in the 'Name' column
print(df['Name'].unique())
4.4. Counting Frequency of Values
df['column'].value_counts()
: Returns a series containing counts of unique values.
# Get frequency counts of unique values in 'Age' column
print(df['Age'].value_counts())
4.5. Correlation Between Columns
df.corr()
: Computes pairwise correlation of numerical columns.
# Compute correlation matrix
print(df.corr())
5. Functions for Displaying Data
In pandas, there are several functions for displaying data in a structured and readable format.
5.1. df.head()
and df.tail()
These functions display the first or last n
rows of the DataFrame.
# Display first 5 rows
print(df.head())
# Display last 3 rows
print(df.tail(3))
5.2. df.sample()
Randomly sample rows from the DataFrame.
# Randomly sample 3 rows
print(df.sample(3))
5.3. df.to_string()
Converts the DataFrame to a string, useful for displaying the entire DataFrame in environments that truncate the output.
# Display entire DataFrame as a string
print(df.to_string())
5.4. Viewing Large DataFrames
pd.set_option()
: Configure pandas options to display more rows or columns in large DataFrames.
# Set pandas option to display more rows
pd.set_option('display.max_rows', 100)
# Set pandas option to display more columns
pd.set_option('display.max_columns', 20)
# Display the DataFrame with updated options
print(df)
5.5. Transposing Data
df.T
: Transposes the DataFrame (swap rows and columns).
# Transpose the DataFrame
print(df.T)
Summary
Here’s a quick summary of key operations for viewing, accessing, checking data types, and descriptive statistics in pandas DataFrames:
Viewing and Accessing Data:
head()
,tail()
: Viewing first/last rows.shape
: DataFrame dimensions.columns
,dtypes
: Viewing columns and data types.iloc[]
,loc[]
: Accessing rows and columns.boolean indexing
: Filtering data based on conditions.
Checking and Changing Data Types:
dtypes
: Check column data types.astype()
: Change column data types.to_datetime()
: Convert to datetime format.
Descriptive Statistics:
describe()
: Summary statistics.nunique()
,unique()
,value_counts()
: Working with unique values and frequency counts.corr()
: Correlation between numerical columns.
These operations help in quickly exploring, inspecting, and understanding data before moving on to deeper analysis or transformations. Let me know if you need further details or examples!
Access, Filter, and Retrieve specific rows and columns from a pandas DataFrame- Selection in Pandas
Selection in pandas refers to the various ways in which you can access, filter, and retrieve specific rows and columns from a pandas DataFrame or Series. pandas offers a variety of methods for selecting data, which include basic indexing, label-based indexing, conditional selection, and more advanced techniques.
Let’s cover all the important selection techniques with explanations and examples.
1. Selecting Columns in pandas
1.1. Selecting a Single Column
You can select a single column from a DataFrame by using df['column_name']
or df.column_name
.
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
# Selecting a single column (two methods)
print(df['Name']) # Method 1
print(df.Name) # Method 2
1.2. Selecting Multiple Columns
To select multiple columns, you pass a list of column names to df[]
.
# Select 'Name' and 'Salary' columns
print(df[['Name', 'Salary']])
2. Selecting Rows in pandas
2.1. Selecting Rows by Index Position with iloc[]
iloc[]
is used to select rows and columns by their integer position.
df.iloc[row_index, column_index]
: Access specific row and column by position.
# Select the first row (index 0)
print(df.iloc[0])
# Select the first two rows
print(df.iloc[:2])
# Select the value from the second row and 'Salary' column
print(df.iloc[1, 2])
2.2. Selecting Rows by Label with loc[]
loc[]
is used to select rows and columns by labels (index labels or column names).
df.loc[row_label, column_label]
: Access specific row and column by label.
# Select the row with index 1 (Bob)
print(df.loc[1])
# Select the 'Salary' of Bob (index 1)
print(df.loc[1, 'Salary'])
# Select the 'Name' and 'Salary' of the row where index is 2
print(df.loc[2, ['Name', 'Salary']])
3. Conditional Selection (Boolean Indexing)
You can use boolean conditions to filter rows based on column values.
3.1. Filtering Rows Based on Conditions
# Select rows where Age is greater than 30
print(df[df['Age'] > 30])
# Select rows where Salary is less than or equal to 60000
print(df[df['Salary'] <= 60000])
3.2. Combining Multiple Conditions
You can combine multiple conditions using &
(AND) or |
(OR) operators, and you must enclose each condition in parentheses.
# Select rows where Age > 30 and Salary > 60000
print(df[(df['Age'] > 30) & (df['Salary'] > 60000)])
# Select rows where Age < 40 or Salary < 60000
print(df[(df['Age'] < 40) | (df['Salary'] < 60000)])
3.3. Filtering Rows Based on String Conditions
You can filter rows based on string values using str.contains()
.
# Select rows where 'Name' contains the letter 'a'
print(df[df['Name'].str.contains('a')])
4. Selecting by Index
You can set an index and then select data based on the index.
4.1. Setting a New Index
# Set 'Name' as the new index
df.set_index('Name', inplace=True)
# Now select row by 'Name' (which is the new index)
print(df.loc['Alice'])
4.2. Resetting the Index
You can reset the index back to the default (integer-based) using reset_index()
.
# Reset the index back to default
df.reset_index(inplace=True)
print(df)
5. Selecting Data with at[]
and iat[]
5.1. Using at[]
for Scalar Access by Label
at[]
is used to access a single value by label (similar to loc[]
but optimized for single value access).
# Get the 'Salary' of the row with index 1 (Bob)
print(df.at[1, 'Salary'])
5.2. Using iat[]
for Scalar Access by Position
iat[]
is used to access a single value by position (similar to iloc[]
but optimized for single value access).
# Get the 'Salary' of the second row (index 1)
print(df.iat[1, 2])
6. Selecting Rows and Columns by Data Type
6.1. Selecting Columns by Data Type
You can select columns based on their data type using select_dtypes()
.
# Select only numeric columns
numeric_df = df.select_dtypes(include='number')
print(numeric_df)
# Select only object (string) columns
string_df = df.select_dtypes(include='object')
print(string_df)
7. Selecting Specific Data Based on a List of Labels
You can filter rows or columns based on a list of labels.
7.1. Selecting Rows Based on a List of Indexes
# Select rows with index 0 and 2
print(df.loc[[0, 2]])
7.2. Selecting Columns Based on a List of Column Names
# Select 'Name' and 'Age' columns
print(df[['Name', 'Age']])
8. Slicing Rows and Columns
You can slice rows and columns using iloc[]
and loc[]
to select a range of rows or columns.
8.1. Slicing Rows with iloc[]
# Select the first 3 rows
print(df.iloc[:3])
# Select rows from index 1 to 3 (exclusive)
print(df.iloc[1:3])
8.2. Slicing Columns with iloc[]
# Select all rows and the first two columns
print(df.iloc[:, :2])
# Select rows 0-2 and columns 1-2
print(df.iloc[0:3, 1:3])
8.3. Slicing Rows and Columns with loc[]
# Select rows 0 to 2 and columns 'Name' and 'Age'
print(df.loc[0:2, ['Name', 'Age']])
9. Selecting Data by Index Ranges
You can select data based on index ranges using loc[]
.
# Select rows from index 1 to 3 (inclusive)
print(df.loc[1:3])
10. Advanced Selections with query()
The query()
method allows you to select rows based on complex conditions using SQL-like syntax.
# Select rows where Age > 30 and Salary < 80000
filtered_df = df.query('Age > 30 and Salary < 80000')
print(filtered_df)
11. Masking Data
You can use mask()
to mask certain data based on conditions, replacing them with a default value (e.g., NaN
).
# Mask Salary values greater than 60000 with NaN
df_masked = df.mask(df['Salary'] > 60000)
print(df_masked)
12. Selecting Missing Data (NaN values)
You can select rows where specific columns have missing data (NaN values).
# Select rows where 'Salary' has NaN values
missing_salary = df[df['Salary'].isna()]
print(missing_salary)
In pandas, selection can be done in various ways depending on what you want to access:
- Columns: Select one or more columns using
df['col']
ordf[['col1', 'col2']]
. - Rows: Select rows using
iloc[]
(by position) orloc[]
(by label). - Conditional selection: Filter rows based on conditions using boolean indexing (
df[df['col'] > value]
). - Indexing: Use
set_index()
,loc[]
, andreset_index()
for selection based on index labels. - Slicing: Slice rows and columns with
iloc[]
orloc[]
. - Advanced queries: Use
query()
for SQL-like row filtering.
These selection techniques are fundamental in data wrangling and allow you to manipulate and analyze your data efficiently. Let me know if you need more examples or specific details on any of the concepts!
Merging, joining, concatenating, comparing, Sorting and Advanced Filtering DataFrames
In pandas, merging, joining, concatenating, and comparing DataFrames are common tasks used to combine or compare datasets. Each operation is tailored for specific use cases, and understanding the difference between them is important for efficient data manipulation.
Let’s explore each operation, covering both basic and advanced use cases, along with detailed examples.
1. Merging DataFrames (pd.merge()
)
merge()
is used for combining two DataFrames based on a common key(s). It’s similar to SQL joins (INNER, LEFT, RIGHT, FULL OUTER).
1.1. Inner Join (Default)
An inner join returns only the rows where the values in the joining columns match in both DataFrames.
import pandas as pd
# Sample DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Salary': [70000, 80000, 90000, 100000]
})
# Merge with inner join (default)
df_inner = pd.merge(df1, df2, on='ID')
print(df_inner)
Output:
plaintextCopy code ID Name Salary
0 3 Charlie 70000
1 4 David 80000
1.2. Left Join
A left join returns all rows from the left DataFrame and only matching rows from the right DataFrame. Non-matching rows from the right DataFrame are filled with NaN
.
pythonCopy code# Left join
df_left = pd.merge(df1, df2, on='ID', how='left')
print(df_left)
Output:
plaintextCopy code ID Name Salary
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 70000.0
3 4 David 80000.0
1.3. Right Join
A right join returns all rows from the right DataFrame and only matching rows from the left DataFrame.
pythonCopy code# Right join
df_right = pd.merge(df1, df2, on='ID', how='right')
print(df_right)
Output:
plaintextCopy code ID Name Salary
0 3 Charlie 70000.0
1 4 David 80000.0
2 5 NaN 90000.0
3 6 NaN 100000.0
1.4. Full Outer Join
A full outer join returns all rows when there is a match in either the left or right DataFrame. Missing values are filled with NaN
.
pythonCopy code# Full outer join
df_outer = pd.merge(df1, df2, on='ID', how='outer')
print(df_outer)
Output:
plaintextCopy code ID Name Salary
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 70000.0
3 4 David 80000.0
4 5 NaN 90000.0
5 6 NaN 100000.0
1.5. Merging on Multiple Columns
You can merge DataFrames on multiple columns by passing a list to the on
parameter.
pythonCopy code# Sample DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Dept': ['HR', 'Finance', 'IT', 'HR']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Dept': ['IT', 'HR', 'Finance', 'IT'],
'Salary': [70000, 80000, 90000, 100000]
})
# Merge on multiple columns
df_multi = pd.merge(df1, df2, on=['ID', 'Dept'], how='inner')
print(df_multi)
2. Joining DataFrames (df.join()
)
The join()
method is similar to merge()
, but it is used primarily to combine DataFrames based on their index rather than a column key. It’s convenient when you need to join two DataFrames based on their index.
2.1. Left Join by Default
By default, join()
performs a left join using the index of the DataFrame.
pythonCopy codedf1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [50000, 60000, 70000]
}, index=[1, 2, 3])
df2 = pd.DataFrame({
'Dept': ['HR', 'Finance', 'IT']
}, index=[1, 2, 4])
# Left join using join()
df_joined = df1.join(df2)
print(df_joined)
Output:
plaintextCopy code Name Salary Dept
1 Alice 50000 HR
2 Bob 60000 Finance
3 Charlie 70000 NaN
2.2. Specifying Different Types of Joins in join()
You can specify the type of join using the how
parameter (e.g., ‘left’, ‘right’, ‘inner’, ‘outer’).
pythonCopy code# Outer join using join()
df_outer_join = df1.join(df2, how='outer')
print(df_outer_join)
3. Concatenating DataFrames (pd.concat()
)
The concat()
function is used to concatenate DataFrames along a particular axis, either vertically (by rows) or horizontally (by columns).
3.1. Concatenating Vertically (Default)
By default, concat()
stacks DataFrames vertically (row-wise).
pythonCopy codedf1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Salary': [70000, 80000]})
# Concatenate vertically (row-wise)
df_concat = pd.concat([df1, df2])
print(df_concat)
Output:
plaintextCopy code Name Salary
0 Alice 50000
1 Bob 60000
0 Charlie 70000
1 David 80000
3.2. Concatenating Horizontally (Column-wise)
You can concatenate DataFrames horizontally (by columns) using axis=1
.
pythonCopy code# Concatenate horizontally (column-wise)
df_concat_cols = pd.concat([df1, df2], axis=1)
print(df_concat_cols)
Output:
plaintextCopy code Name Salary Name Salary
0 Alice 50000 Charlie 70000
1 Bob 60000 David 80000
3.3. Resetting Index After Concatenation
The index is preserved when concatenating, but you can reset the index using ignore_index=True
.
pythonCopy code# Concatenate vertically and reset index
df_concat_reset = pd.concat([df1, df2], ignore_index=True)
print(df_concat_reset)
Output:
plaintextCopy code Name Salary
0 Alice 50000
1 Bob 60000
2 Charlie 70000
3 David 80000
4. Comparing DataFrames (pd.DataFrame.compare()
)
compare()
is used to compare two DataFrames and highlight the differences between them.
4.1. Comparing Two DataFrames
pythonCopy code# Sample DataFrames
df1 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [50000, 60000, 70000]
})
df2 = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Salary': [50000, 65000, 70000]
})
# Compare the two DataFrames
df_compare = df1.compare(df2)
print(df_compare)
Output:
plaintextCopy code Salary
self other
1 60000 65000
4.2. Including All Columns in the Comparison
You can include all columns in the comparison, even those without differences, using the keep_equal=True
parameter.
pythonCopy code# Compare and include equal values
df_compare_all = df1.compare(df2, keep_equal=True)
print(df_compare_all)
5. Sorting in pandas
You can sort a DataFrame by one or more columns in ascending or descending order using sort_values()
.
5.1. Sorting by a Single Column
sort_values(by='column_name')
: Sort the DataFrame by a single column in ascending order (default).
pythonCopy code# Sample DataFrame
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]
})
# Sort by 'Age' in ascending order
df_sorted = df.sort_values(by='Age')
print(df_sorted)
Output:
plaintextCopy code Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000
5.2. Sorting by Multiple Columns
You can sort by multiple columns, specifying whether each column should be sorted in ascending or descending order.
pythonCopy code# Sort by 'Age' in ascending order and 'Salary' in descending order
df_sorted_multi = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])
print(df_sorted_multi)
Output:
Name Age Salary
0 Alice 25 50000
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000
5.3. Sorting by Index
sort_index()
: Sort the DataFrame by its index.
# Sort by index
df_sorted_index = df.sort_index()
print(df_sorted_index)
6. Advanced Filtering in pandas
Advanced filtering allows you to retrieve specific subsets of data based on complex conditions, such as combining multiple conditions, filtering by partial string matches, and more.
6.1. Filtering Rows Based on Multiple Conditions
You can filter DataFrames based on multiple conditions by combining them with &
(AND) or |
(OR). Each condition must be enclosed in parentheses.
# Filter rows where Age > 30 and Salary > 60000
df_filtered = df[(df['Age'] > 30) & (df['Salary'] > 60000)]
print(df_filtered)
Output:
Name Age Salary
2 Charlie 35 70000
3 David 40 80000
6.2. Filtering Rows Based on String Conditions
You can filter rows based on string conditions, such as checking if a column contains a certain substring using str.contains()
.
# Filter rows where 'Name' contains the letter 'a'
df_filtered_str = df[df['Name'].str.contains('a')]
print(df_filtered_str)
Output:
Name Age Salary
0 Alice 25 50000
2 Charlie 35 70000
3 David 40 80000
6.3. Filtering Rows Based on Ranges
You can filter rows where column values fall within a certain range using between()
.
pythonCopy code# Filter rows where 'Salary' is between 55000 and 75000
df_filtered_range = df[df['Salary'].between(55000, 75000)]
print(df_filtered_range)
Output:
plaintextCopy code Name Age Salary
1 Bob 30 60000
2 Charlie 35 70000
6.4. Using query()
for Advanced Filtering
query()
allows you to use SQL-like queries to filter your DataFrame based on complex conditions.
pythonCopy code# Use query to filter rows where Age > 30 and Salary < 80000
df_filtered_query = df.query('Age > 30 and Salary < 80000')
print(df_filtered_query)
Output:
plaintextCopy code Name Age Salary
2 Charlie 35 70000
Combining Sorting, Filtering, and Merging
Now, let’s see how sorting and filtering can be combined with merge, join, concatenation, and comparison operations.
6.5. Filtering and Sorting with Merge
After merging two DataFrames, you can filter and sort the result.
pythonCopy code# Sample DataFrames for merging
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Salary': [70000, 80000, 90000, 100000]
})
# Merge DataFrames (inner join)
df_merged = pd.merge(df1, df2, on='ID', how='inner')
# Filter rows where Salary > 75000 and sort by 'Salary'
df_merged_filtered_sorted = df_merged[df_merged['Salary'] > 75000].sort_values(by='Salary', ascending=False)
print(df_merged_filtered_sorted)
Output:
plaintextCopy code ID Name Salary
1 4 David 80000
0 3 Charlie 70000
6.6. Filtering After Concatenation
You can filter rows after concatenating multiple DataFrames.
pythonCopy code# Sample DataFrames for concatenation
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Salary': [70000, 80000]})
# Concatenate DataFrames
df_concat = pd.concat([df1, df2])
# Filter rows where 'Salary' > 60000
df_concat_filtered = df_concat[df_concat['Salary'] > 60000]
print(df_concat_filtered)
Output:
plaintextCopy code Name Salary
2 Charlie 70000
3 David 80000
Full Example: Sorting, Filtering, Merging, and Concatenation
Here’s a complete example that demonstrates how to use sorting, filtering, and merging in a real-world scenario:
# Sample DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
})
df2 = pd.DataFrame({
'ID': [3, 4, 5, 6],
'Dept': ['HR', 'IT', 'Finance', 'Marketing'],
'Salary': [70000, 80000, 90000, 100000]
})
# Merge the DataFrames
df_merged = pd.merge(df1, df2, on='ID', how='outer')
# Filter rows where Salary > 60000 and Dept is 'HR' or 'IT'
df_filtered = df_merged[(df_merged['Salary'] > 60000) & (df_merged['Dept'].isin(['HR', 'IT']))]
# Sort the result by 'Salary' in descending order
df_sorted = df_filtered.sort_values(by='Salary', ascending=False)
print(df_sorted)
Output:
plaintextCopy code ID Name Dept Salary
1 4 David IT 80000
0 3 Charlie HR 70000
- Sorting (
sort_values()
): Sort by one or more columns, or by index usingsort_index()
. - Advanced Filtering: Filter rows based on complex conditions using boolean indexing,
str.contains()
,between()
, andquery()
. - Combining Sorting and Filtering with Merge/Join: You can merge/join DataFrames and then apply sorting and filtering to the combined dataset.
- Concatenation: After concatenation, apply filtering to refine the data.
These operations allow you to efficiently manipulate data using pandas, combining data from multiple sources and selecting, sorting, and filtering relevant information. Let me know if you need more examples or specific clarifications!
Pivoting, melting, and transposing operations in pandas
In this section, we will dive deeper into the pivoting, melting, and transposing operations in pandas. These are essential for reshaping data in DataFrames, enabling us to structure data for analysis, reporting, or visualization purposes.
1. Pivoting in pandas
Pivoting refers to reshaping data by turning unique values from one column into multiple columns (wide format). It allows transforming long-form data (where a single row represents a data point) into wide-form data (where multiple columns represent different aspects of the same data point).
1.1. pivot()
Function
The pivot()
function reshapes data from long to wide format by specifying an index, columns, and values.
Syntax:
pythonCopy codedf.pivot(index='row_label', columns='column_label', values='value_column')
index
: The column(s) to use as the new DataFrame’s index.columns
: The column(s) to spread across the new DataFrame as headers.values
: The values to fill the DataFrame with, based on the pivot.
1.2. Example of Pivoting
pythonCopy codeimport pandas as pd
# Sample data in long format
df = pd.DataFrame({
'Name': ['Alice', 'Alice', 'Bob', 'Bob'],
'Year': [2020, 2021, 2020, 2021],
'Sales': [1000, 1100, 2000, 2100]
})
# Pivot the DataFrame to show Sales for each Year as columns
pivot_df = df.pivot(index='Name', columns='Year', values='Sales')
print(pivot_df)
Output:
yamlCopy codeYear 2020 2021
Name
Alice 1000 1100
Bob 2000 2100
1.3. Use Cases for Pivoting
- Summarizing Data: Pivoting is useful when you want to summarize data across multiple categories and present the data in a more readable, wide format.
- Visualizing Data: It is often used before creating plots or charts, especially when visualizing time-series data for multiple entities.
- Data Aggregation: Pivoting is helpful when you want to compare multiple measures across different dimensions.
1.4. Handling Duplicates in Pivoting
If there are duplicate entries for the same index-column combination, pivot()
will raise an error. To handle aggregation of duplicates, use pivot_table()
.
2. Pivot Table in pandas
pivot_table()
works similarly to pivot()
, but it allows aggregation when there are duplicate values. It can be seen as a more flexible version of pivot()
.
2.1. Syntax for pivot_table()
pythonCopy codedf.pivot_table(index='row_label', columns='column_label', values='value_column', aggfunc='mean')
aggfunc
: The aggregation function to apply (e.g.,mean
,sum
,count
,max
, etc.). Default ismean
.
2.2. Example of pivot_table()
pythonCopy codedf = pd.DataFrame({
'Name': ['Alice', 'Alice', 'Bob', 'Bob', 'Bob'],
'Year': [2020, 2021, 2020, 2021, 2020],
'Sales': [1000, 1100, 2000, 2100, 1800]
})
# Pivot table with sum aggregation for duplicate entries
pivot_table_df = df.pivot_table(index='Name', columns='Year', values='Sales', aggfunc='sum')
print(pivot_table_df)
Output:
yamlCopy codeYear 2020 2021
Name
Alice 1000 1100
Bob 3800 2100
2.3. Use Cases for pivot_table()
- Handling Duplicates: It is useful when you want to handle duplicate entries by applying an aggregation function (e.g., summing sales across multiple entries).
- Data Summarization: When performing group-by like operations and you want to summarize data into a matrix.
- Flexible Aggregation: You can specify various aggregation functions to summarize data (e.g., sum, average, count).
3. Melting in pandas
Melting is the opposite of pivoting. It reshapes wide-form data into long-form data by “unpivoting” columns back into rows. This is useful when you want to bring multiple columns into a single column, often for plotting or further analysis.
3.1. melt()
Function
The melt()
function reshapes a DataFrame by transforming columns into rows, returning long-form data.
Syntax:
pythonCopy codedf.melt(id_vars=['fixed_column'], value_vars=['columns_to_unpivot'], var_name='variable_name', value_name='value_name')
id_vars
: Columns that should remain fixed (not melted).value_vars
: Columns to unpivot (convert from wide to long).var_name
: The name of the new “variable” column.value_name
: The name of the new “value” column.
3.2. Example of Melting
pythonCopy codedf = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'2020': [1000, 2000],
'2021': [1100, 2100]
})
# Melt the DataFrame from wide to long format
melted_df = df.melt(id_vars=['Name'], value_vars=['2020', '2021'], var_name='Year', value_name='Sales')
print(melted_df)
Output:
yamlCopy code Name Year Sales
0 Alice 2020 1000
1 Bob 2020 2000
2 Alice 2021 1100
3 Bob 2021 2100
3.3. Use Cases for Melting
- Preparing Data for Plotting: Many plotting libraries prefer data in long format, where each row is a single data point.
- Unpivoting Wide Data: When you have data spread across columns and you want to analyze or process it row-wise.
- Handling Multiple Time Periods: When dealing with time series data, you often have columns for each year or month. Melting can help consolidate these columns into a single “period” column.
4. Transposing in pandas
Transposing is the process of swapping rows and columns in a DataFrame, where the index becomes the column headers and vice versa.
4.1. transpose()
or T
The transpose()
function or T
attribute is used to interchange rows and columns.
Syntax:
pythonCopy codedf.transpose() # or simply df.T
4.2. Example of Transposing
pythonCopy codedf = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Age': [25, 30],
'Salary': [50000, 60000]
})
# Transpose the DataFrame
transposed_df = df.transpose()
print(transposed_df)
Output:
markdownCopy code 0 1
Name Alice Bob
Age 25 30
Salary 50000 60000
4.3. Use Cases for Transposing
- Data Presentation: Sometimes transposing data can make it easier to present or visualize. For example, showing each record’s details in columns rather than rows.
- Analyzing Wide Data: If your data has more columns than rows, transposing can help make the dataset easier to inspect.
- Pivot-Like Operations: In scenarios where you want to transform row-wise data into column-wise data quickly.
Summary of Pivoting, Melting, and Transposing
Operation | Function | Description | Use Case |
---|---|---|---|
Pivoting | pivot() | Reshapes data from long to wide format by turning unique values from a column into new column headers. | Summarizing data, comparing across categories, preparing for visualization. |
Pivot Table | pivot_table() | Similar to pivot() , but allows aggregation for duplicate values and provides more flexibility. | Summarizing data, aggregating data with functions (sum, mean, etc.). |
Melting | melt() | Unpivots data from wide to long format, turning columns into rows. | Preparing data for plotting, handling time series, unpivoting wide-form data. |
Transposing | transpose() | Swaps rows and columns, turning the DataFrame’s index into columns and vice versa. | Analyzing data from different perspectives, data presentation, pivot-like operations. |
Conclusion
- Pivoting reshapes long data into wide data, which is useful for summarization and comparison across categories.
- Pivot tables extend pivoting by allowing aggregation when duplicates are present.
- Melting converts wide data into long data, preparing it for time-series analysis or plotting.
- Transposing swaps rows and columns to change the perspective of your data, helping with analysis or presentation.
These reshaping operations are fundamental to data wrangling and make pandas a versatile library for handling complex data structures. Let me know if you need more details or examples on any specific topic!
Doubts / Questions / Usecase
====================================================
====================================================
How to find third best saliered employee in sales department in Python?
To find the third highest salaried employee in the Sales department in Python using pandas, you can follow these steps:
Steps:
- Filter the employees who belong to the Sales department.
- Sort the employees by their salary in descending order.
- Extract the third employee based on the salary ranking.
Example Code:
import pandas as pd
# Sample data for employees
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
'Department': ['Sales', 'Sales', 'Marketing', 'Sales', 'HR', 'Sales'],
'Salary': [50000, 60000, 70000, 80000, 90000, 55000]
}
# Create a DataFrame
df = pd.DataFrame(data)
# Step 1: Filter employees in the Sales department
sales_df = df[df['Department'] == 'Sales']
# Step 2: Sort employees by Salary in descending order
sorted_sales_df = sales_df.sort_values(by='Salary', ascending=False)
# Step 3: Find the third highest salaried employee
third_highest_salary = sorted_sales_df.iloc[2] # Index 2 corresponds to the third entry
# Output the result
print("Third highest salaried employee in Sales department:")
print(third_highest_salary)
Explanation:
- Filter: The DataFrame is filtered to include only rows where the
Department
is Sales. - Sort: The filtered DataFrame is sorted by the Salary column in descending order, so the highest salaries appear first.
- Extract: The third entry in the sorted DataFrame is selected using
iloc[2]
(0-based indexing).
Sample Output:
Third highest salaried employee in Sales department:
Name Frank
Department Sales
Salary 55000
Name: 5, dtype: object
Alternate Approach (If There Are Duplicate Salaries):
If salaries are not unique, and you want to find the third distinct highest salary, you can do the following:
# Step 1: Filter employees in the Sales department
sales_df = df[df['Department'] == 'Sales']
# Step 2: Get unique salaries, sort them in descending order
unique_salaries = sales_df['Salary'].drop_duplicates().sort_values(ascending=False)
# Step 3: Get the third highest salary
third_highest_salary_value = unique_salaries.iloc[2] # Third highest salary value
# Step 4: Find the employee(s) with this salary
third_highest_employee = sales_df[sales_df['Salary'] == third_highest_salary_value]
# Output the result
print("Third highest salaried employee(s) in Sales department:")
print(third_highest_employee)
This approach ensures you get the third distinct highest salary, in case there are multiple employees with the same salary value.
can we use something like limiting the dataframe output?
Yes, in pandas, you can limit the number of rows returned from a DataFrame by using head()
, tail()
, or iloc[]
. This is useful when you want to fetch only a limited number of results without needing to manually slice the DataFrame.
How to Limit DataFrame Output in pandas
head(n)
: Returns the firstn
rows of the DataFrame.iloc[:n]
: Selects the firstn
rows by integer-location based indexing.- Using
iloc[]
withsort_values()
: When limiting rows after sorting.
Example: Limiting DataFrame Output After Sorting
Here’s how you can find the third highest salaried employee in the Sales department and limit the DataFrame output to only a few rows:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
'Department': ['Sales', 'Sales', 'Marketing', 'Sales', 'HR', 'Sales'],
'Salary': [50000, 60000, 70000, 80000, 90000, 55000]
}
df = pd.DataFrame(data)
# Step 1: Filter employees in the Sales department
sales_df = df[df['Department'] == 'Sales']
# Step 2: Sort employees by Salary in descending order
sorted_sales_df = sales_df.sort_values(by='Salary', ascending=False)
# Step 3: Limit the result to the first 3 rows using head()
limited_sales_df = sorted_sales_df.head(3)
# Output the result (third highest employee)
print("Top 3 salaried employees in Sales department:")
print(limited_sales_df)
Explanation:
head(3)
: Limits the DataFrame to the first 3 rows after sorting by salary.- You can use
iloc[2]
to get the third-highest salary after limiting the result to just the top 3.
Sample Output:
Top 3 salaried employees in Sales department:
Name Department Salary
3 David Sales 80000
1 Bob Sales 60000
5 Frank Sales 55000
Alternate Limiting Using iloc[]
for Exact Indexing:
You can directly use iloc[]
to limit and access the rows you want:
# Fetch only the third row (third highest salaried employee) after sorting
third_highest_salary = sorted_sales_df.iloc[2]
print("Third highest salaried employee in Sales department:")
print(third_highest_salary)
Sample Output:
Third highest salaried employee in Sales department:
Name Frank
Department Sales
Salary 55000
Name: 5, dtype: object
Use Case of Limiting the DataFrame:
Limiting is useful when:
- You want to display the top N results (e.g., top 5 highest salaries).
- You are dealing with large DataFrames and only need to inspect a small portion of the data.
- You want to optimize performance when working with large datasets by avoiding fetching more rows than needed.
Here’s a pandas cheat s To save a pandas DataFrame to disk, you can use several built-in methods to export data in various formats, such as CSV, Excel, JSON, and SQL. Here are the most common options:
1. Saving DataFrame as CSV
The to_csv()
function exports the DataFrame to a CSV (Comma-Separated Values) file.
Syntax:
pythonCopy codedf.to_csv('file_name.csv', index=False)
index=False
: Excludes row indices in the saved file.sep=','
: (Optional) Specify the delimiter (default is a comma).
Example:
pythonCopy codeimport pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
# Save DataFrame as CSV
df.to_csv('output.csv', index=False)
2. Saving DataFrame as Excel
The to_excel()
function exports the DataFrame to an Excel (.xlsx) file.
Syntax:
pythonCopy codedf.to_excel('file_name.xlsx', sheet_name='Sheet1', index=False)
sheet_name='Sheet1'
: (Optional) Specify the sheet name.index=False
: Excludes row indices in the saved file.
Example:
pythonCopy code# Save DataFrame as Excel file
df.to_excel('output.xlsx', sheet_name='Data', index=False)
Note: The
to_excel()
function requiresopenpyxl
for.xlsx
files, so make sure it’s installed (pip install openpyxl
).
3. Saving DataFrame as JSON
The to_json()
function exports the DataFrame to a JSON (JavaScript Object Notation) file, useful for web applications.
Syntax:
pythonCopy codedf.to_json('file_name.json', orient='records')
orient='records'
: Each row is exported as a separate JSON object.
Example:
pythonCopy code# Save DataFrame as JSON file
df.to_json('output.json', orient='records')
4. Saving DataFrame to SQL Database
The to_sql()
function exports the DataFrame to a SQL table in a relational database.
Syntax:
pythonCopy codedf.to_sql('table_name', con=connection, if_exists='replace', index=False)
con
: The SQLAlchemy engine or database connection object.if_exists='replace'
: Replaces the table if it already exists; use'append'
to add data to the existing table.index=False
: Excludes row indices in the table.
Example:
pythonCopy codefrom sqlalchemy import create_engine
# Create a database connection (SQLite in this example)
engine = create_engine('sqlite:///my_database.db')
# Save DataFrame to SQL table
df.to_sql('people', con=engine, if_exists='replace', index=False)
5. Saving DataFrame as Parquet
The to_parquet()
function exports the DataFrame to a Parquet file, which is a popular columnar storage format, especially in big data environments.
Syntax:
pythonCopy codedf.to_parquet('file_name.parquet')
Example:
# Save DataFrame as Parquet file
df.to_parquet('output.parquet')
Note: Saving as Parquet requires
pyarrow
orfastparquet
libraries (pip install pyarrow
orpip install fastparquet
).
6. Saving DataFrame as HTML
The to_html()
function saves the DataFrame as an HTML table, useful for displaying data on web pages.
Syntax:
df.to_html('file_name.html', index=False)
Example:
# Save DataFrame as HTML file
df.to_html('output.html', index=False)
Summary
Format | Method | Example Syntax |
---|---|---|
CSV | to_csv() | df.to_csv('file.csv', index=False) |
Excel | to_excel() | df.to_excel('file.xlsx', index=False) |
JSON | to_json() | df.to_json('file.json', orient='records') |
SQL | to_sql() | df.to_sql('table', con=engine, index=False) |
Parquet | to_parquet() | df.to_parquet('file.parquet') |
HTML | to_html() | df.to_html('file.html', index=False) |
Each method provides a straightforward way to export pandas DataFrames in various formats depending on your needs. Let me know if you need further details!heet to quickly revise the most important concepts, commands, and functions. It’s structured by category for easy navigation.
1. Basics and Setup
Operation | Command | Description |
---|---|---|
Import pandas | import pandas as pd | Import pandas library. |
Read CSV | pd.read_csv('file.csv') | Read CSV file into DataFrame. |
Read Excel | pd.read_excel('file.xlsx') | Read Excel file into DataFrame. |
Save to CSV | df.to_csv('file.csv') | Save DataFrame to CSV. |
Save to Excel | df.to_excel('file.xlsx') | Save DataFrame to Excel. |
2. Data Inspection
Operation | Command | Description |
---|---|---|
Head and Tail | df.head(n) , df.tail(n) | Display first/last n rows. |
Shape of DataFrame | df.shape | Get the dimensions (rows, columns) of DataFrame. |
Info Summary | df.info() | Overview of DataFrame (non-null count, dtypes). |
Column Names | df.columns | List all column names. |
Data Types | df.dtypes | Check data types of all columns. |
Describe Data | df.describe() | Summary statistics for numerical columns. |
Unique Values | df['col'].unique() | Get unique values in a column. |
Value Counts | df['col'].value_counts() | Count unique values in a column. |
3. Selecting and Indexing
Operation | Command | Description |
---|---|---|
Select Column(s) | df['col'] , df[['col1', 'col2']] | Select one or multiple columns. |
Select Row by Label | df.loc['label'] | Select row by index label. |
Select Row by Position | df.iloc[position] | Select row by integer position. |
Slice Rows | df[5:10] | Slice rows from index 5 to 10. |
Conditional Selection | df[df['col'] > value] | Filter rows based on condition. |
Set Index | df.set_index('col', inplace=True) | Set column as index. |
Reset Index | df.reset_index(inplace=True) | Reset index to default. |
Multi-Indexing | pd.MultiIndex.from_arrays([list1, list2]) | Create hierarchical indexing. |
4. Data Cleaning
Operation | Command | Description |
---|---|---|
Rename Columns | df.rename(columns={'old': 'new'}) | Rename columns in DataFrame. |
Drop Columns | df.drop(['col1', 'col2'], axis=1) | Drop specified columns. |
Drop Rows | df.drop([0, 1], axis=0) | Drop specified rows. |
Drop Missing Values | df.dropna() | Drop rows with NaN values. |
Fill Missing Values | df.fillna(value) | Fill NaN values with a specified value. |
Replace Values | df.replace({old: new}) | Replace values throughout DataFrame. |
Remove Duplicates | df.drop_duplicates() | Remove duplicate rows. |
Change Data Type | df['col'] = df['col'].astype(type) | Change data type of a column. |
Handle Outliers | df[df['col'] < threshold] | Filter outliers based on threshold. |
5. Data Transformation
Operation | Command | Description |
---|---|---|
Apply Function to Column | df['col'].apply(func) | Apply function to each element in column. |
Lambda Function | df['col'].apply(lambda x: x + 2) | Apply lambda function to a column. |
Map Values | df['col'].map({old: new}) | Map values in column to new values. |
Replace with Conditions | df['col'] = df['col'].where(cond, new_val) | Conditionally replace values. |
Binning Data | pd.cut(df['col'], bins) | Bin data into intervals. |
Standardize/Normalize | (df['col'] - df['col'].mean()) / df['col'].std() | Standardize column. |
6. Aggregation and Grouping
Operation | Command | Description |
---|---|---|
Sum/Mean/Max/Min | df['col'].sum() , df['col'].mean() | Basic aggregation functions on columns. |
GroupBy | df.groupby('col').sum() | Group by column and apply aggregation. |
Multiple Aggregations | df.groupby('col').agg({'col1': 'sum', 'col2': 'mean'}) | Multiple aggregations on grouped data. |
Pivot Table | df.pivot_table(index='col1', columns='col2', values='col3', aggfunc='mean') | Create pivot table. |
Cumulative Sum | df['col'].cumsum() | Calculate cumulative sum. |
Rolling Calculations | df['col'].rolling(window=3).mean() | Apply rolling calculations (moving average, etc.). |
7. Merging, Joining, Concatenating
Operation | Command | Description |
---|---|---|
Concatenate DataFrames | pd.concat([df1, df2]) | Concatenate DataFrames along rows or columns. |
Merge DataFrames | pd.merge(df1, df2, on='key') | Merge DataFrames based on key. |
Left/Right/Inner/Outer Join | pd.merge(df1, df2, how='left', on='key') | Perform different types of joins. |
Join on Index | df1.join(df2, on='key') | Join using index as the key. |
Append DataFrame | df1.append(df2) | Append rows of df2 to df1 . |
8. Reshaping Data
Operation | Command | Description |
---|---|---|
Pivot | df.pivot(index='col1', columns='col2', values='col3') | Reshape data (long to wide). |
Pivot Table | df.pivot_table(index='col1', columns='col2', values='col3', aggfunc='mean') | Reshape with aggregation. |
Melt | df.melt(id_vars=['col1'], value_vars=['col2']) | Unpivot from wide to long format. |
Transpose | df.T | Transpose rows and columns. |
Stack | df.stack() | Stack columns to row MultiIndex. |
Unstack | df.unstack() | Unstack row index to columns. |
9. Working with Dates
Operation | Command | Description |
---|---|---|
Convert to Datetime | pd.to_datetime(df['col']) | Convert column to datetime format. |
Extract Year/Month/Day | df['col'].dt.year , df['col'].dt.month | Extract parts of date. |
Date Range | pd.date_range(start, end, freq='D') | Generate a range of dates. |
Time Difference | (df['end'] - df['start']).dt.days | Calculate time difference in days. |
Resample | df.resample('M').mean() | Resample time series to a specific frequency. |
10. String Operations
Operation | Command | Description |
---|---|---|
Convert to Lowercase | df['col'].str.lower() | Convert strings to lowercase. |
Contains Substring | df['col'].str.contains('substring') | Check if strings contain a substring. |
Replace Substring | df['col'].str.replace('old', 'new') | Replace substring in strings. |
Split Strings | df['col'].str.split(' ') | Split strings on a delimiter. |
Length of Strings | df['col'].str.len() | Get length of each string in a column. |
11. Useful Plotting
Operation | Command | Description |
---|---|---|
Line Plot | df.plot.line() | Plot data as a line chart. |
Bar Plot | df.plot.bar() | Plot data as a bar chart. |
Histogram | df['col'].plot.hist() | Plot data as a histogram. |
Box Plot | df.plot.box() | Plot data as a box plot. |
Scatter Plot | df.plot.scatter(x='col1', y='col2') | Scatter plot between two columns. |
12. Exporting Data
Operation | Command | Description |
---|---|---|
To CSV | df.to_csv('file.csv', index=False) | Export DataFrame to CSV file. |
To Excel | df.to_excel('file.xlsx', index=False) | Export DataFrame to Excel file. |
To JSON | df.to_json('file.json') | Export DataFrame to JSON file. |
To SQL | df.to_sql('table_name', con=connection) | Export DataFrame to SQL database. |
This cheat sheet covers the essentials of pandas for data manipulation, analysis, and exporting results. With this reference, you can quickly review and apply the most common pandas operations.
Leave a Reply