Pandas operations, functions, and use cases ranging from basic operations like filtering, merging, and sorting, to more advanced topics like handling missing data, error handling

by lochan2014 | Oct 24, 2024 | Python | 0 comments

This tutorial covers a wide range of pandas operations and advanced concepts with examples that are practical and useful in real-world scenarios. The key topics include:

Creating DataFrames, Series from various sources.
Checking and changing data types.
Looping through DataFrames efficiently.
Handling missing data.
Advanced data manipulation techniques like filtering, sorting, grouping, and merging.
Window functions.
Error handling and control mechanisms.

In pandas, there are several core data structures designed for handling different types of data, enabling efficient and flexible data manipulation. These data structures include:

Series
DataFrame
Index
MultiIndex
Panel (Deprecated)

Each of these structures serves different purposes, from storing one-dimensional data to complex, multi-dimensional datasets.

1. Series

A Series is a one-dimensional array with labels, capable of holding any data type (integers, strings, floats, Python objects, etc.). It can be thought of as a column in a DataFrame or a labeled list. Each item in a Series has a unique index.

Characteristics of Series:

One-dimensional.
Stores homogeneous data (though technically it can hold mixed data types).
Has an index for each element, which can be customized.

Creating a Series:

import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40]
s = pd.Series(data, index=['a', 'b', 'c', 'd'])
print(s)

Output:

a    10
b    20
c    30
d    40
dtype: int64

2. DataFrame

A DataFrame is a two-dimensional, tabular data structure with labeled axes (rows and columns). It is the most commonly used pandas structure for handling datasets, where each column is a Series, and columns can have different data types.

Characteristics of DataFrame:

Two-dimensional (rows and columns).
Stores heterogeneous data types (each column can be of a different type).
Has labeled axes (row index and column labels).
Flexible and highly customizable.

Creating a DataFrame:

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [24, 27, 22],
    'Salary': [70000, 80000, 60000]
}

df = pd.DataFrame(data)
print(df)

Output:

      Name  Age  Salary
0    Alice   24   70000
1      Bob   27   80000
2  Charlie   22   60000

3. Index

An Index is an immutable data structure that labels the rows and columns of a pandas Series or DataFrame. It is a key part of pandas’ design, allowing for efficient data alignment and retrieval.

Characteristics of Index:

It is immutable (cannot be changed after creation).
Allows for efficient label-based data retrieval.
Supports custom and multi-level indexing.

Creating an Index:

# Creating a custom Index for a DataFrame
index = pd.Index(['a', 'b', 'c'])
data = {'Value': [10, 20, 30]}
df = pd.DataFrame(data, index=index)
print(df)

Output:

   Value
a     10
b     20
c     30

4. MultiIndex

A MultiIndex is an advanced index structure that allows for multiple levels or hierarchical indexing. This enables you to create a DataFrame with more complex indexing, which is useful when working with data that has multiple dimensions (e.g., time series data with both date and time).

Characteristics of MultiIndex:

Supports hierarchical (multi-level) indexing.
Facilitates working with higher-dimensional data in a 2D DataFrame.
Provides powerful data slicing and subsetting capabilities.

Creating a MultiIndex DataFrame:

arrays = [
    ['A', 'A', 'B', 'B'],
    ['one', 'two', 'one', 'two']
]
index = pd.MultiIndex.from_arrays(arrays, names=('Upper', 'Lower'))

data = {'Value': [1, 2, 3, 4]}
df = pd.DataFrame(data, index=index)
print(df)

Output:

           Value
Upper Lower
A     one       1
      two       2
B     one       3
      two       4

5. Panel (Deprecated)

Panel was a three-dimensional data structure in pandas that allowed for storing 3D data, making it possible to have multiple DataFrames within one structure. However, it was deprecated in pandas 0.25.0 and later removed, as the community now recommends using MultiIndex DataFrames or xarray (for multidimensional data).

Alternative:

For handling three or more dimensions, use a MultiIndex DataFrame or the xarray library, which is specifically designed for multidimensional arrays.

Summary Table

Data Structure	Dimensions	Description	Usage
Series	1D	One-dimensional array with labeled indices.	Storing single columns or lists with labels.
DataFrame	2D	Two-dimensional, table-like structure with labeled rows and columns.	Primary data structure for handling tabular data.
Index	1D	Immutable labels for rows and columns in Series and DataFrames.	Used for indexing and aligning data.
MultiIndex	2D (multi-level)	Hierarchical indexing allowing multiple levels of row/column labels.	Organizing complex data, such as time series with multiple dimensions.
Panel	3D (Deprecated)	Three-dimensional data structure (removed from pandas).	Use MultiIndex DataFrames or xarray instead for multidimensional data.

Use Cases for Each Data Structure

Series:
- Ideal for handling individual columns in isolation.
- Useful when working with single-column data or performing calculations on one dimension.
DataFrame:
- Perfect for most data analysis tasks with two-dimensional data.
- Used for data cleaning, manipulation, transformation, and visualization.
Index:
- Helps to uniquely identify each row or column in Series and DataFrames.
- Useful for efficient data selection and alignment.
MultiIndex:
- Essential for handling multi-dimensional data within a 2D structure.
- Allows for advanced slicing and subsetting in complex datasets.

Creating DataFrames from various sources.

Pandas provides various ways to create DataFrames from different data sources. In this section, we’ll go over the most common and practical methods for creating DataFrames using data from:

Python data structures (lists, dictionaries, tuples, arrays)
CSV files
Excel files
JSON files
SQL databases
Web data (HTML tables)
Other sources (e.g., clipboard, API)

1. Creating DataFrames from Python Data Structures

1.1. From a Dictionary

A dictionary in Python is a very common way to create a DataFrame, where keys represent column names and values represent data.

import pandas as pd

# Example 1: Dictionary of lists
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [50000, 60000, 70000]
}

df = pd.DataFrame(data)
print(df)

fruits = pd.DataFrame({‘Apple’:[30],’Banana’:[21]})
why not fruits = pd.DataFrame({‘Apple’:30,’Banana’:21}) works

The reason fruits = pd.DataFrame({'Apple':30,'Banana':21}) doesn’t work as expected is due to how Pandas’ DataFrame constructor handles dictionary inputs.

Short answer:

When passing a dictionary to pd.DataFrame(), Pandas expects the dictionary values to be lists or arrays, not scalars.

Long answer:

When you pass a dictionary to pd.DataFrame(), Pandas treats each dictionary key as a column name and each corresponding value as a column of data. Pandas expects the column data to be:

Lists: [value1, value2, ..., valueN]
Arrays: np.array([value1, value2, ..., valueN])
Series: pd.Series([value1, value2, ..., valueN])

However, in your example, {'Apple': 30, 'Banana': 21}, the dictionary values are scalars (integers), not lists or arrays. Pandas doesn’t know how to handle scalar values as column data.

What happens when you pass scalars:

When you pass scalars, Pandas will attempt to create a DataFrame with a single row, but it will throw an error because it expects an index (row label) for that single row.

Why {'Apple':[30],'Banana':[21]} works:

By wrapping the scalar values in lists ([30] and [21]), you’re providing Pandas with the expected input format. Pandas creates a DataFrame with a single row and two columns, where each column contains a single value.

Alternative solution:

If you prefer not to wrap scalars in lists, you can use the pd.DataFrame() constructor with the index parameter to specify the row label:

fruits = pd.DataFrame({'Apple': 30, 'Banana': 21}, index=[0])

This creates a DataFrame with a single row labeled 0.

Best practice:

For consistency and clarity, it’s recommended to always pass lists or arrays as dictionary values when creating DataFrames:

fruits = pd.DataFrame({'Apple': [30], 'Banana': [21]})

1.2. From a List of Lists

You can create a DataFrame by passing a list of lists and specifying column names.

# Example 2: List of lists
data = [
    ['Alice', 25, 50000],
    ['Bob', 30, 60000],
    ['Charlie', 35, 70000]
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)

1.3. From a List of Dictionaries

Each dictionary in the list represents a row, where keys are column names.

# Example 3: List of dictionaries
data = [
    {'Name': 'Alice', 'Age': 25, 'Salary': 50000},
    {'Name': 'Bob', 'Age': 30, 'Salary': 60000},
    {'Name': 'Charlie', 'Age': 35, 'Salary': 70000}
]

df = pd.DataFrame(data)
print(df)

1.4. From Tuples

A DataFrame can also be created from a list of tuples.

# Example 4: List of tuples
data = [
    ('Alice', 25, 50000),
    ('Bob', 30, 60000),
    ('Charlie', 35, 70000)
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)

data = [ (‘Alice’, 25, 50000), (‘Bob’, 30, 60000), (‘Charlie’, 35, 70000) ]
df = pd.DataFrame(data, columns=[‘Name’, ‘Age’, ‘Salary’]) , so How is A tuple being presented in Pandas in above example?

data = [
    ('Alice', 25, 50000),
    ('Bob', 30, 60000),
    ('Charlie', 35, 70000)
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)

Each tuple (('Alice', 25, 50000), etc.) represents a row in the DataFrame. The tuple elements are not scalars in this context; instead, they are row values.

Pandas treats each tuple as a single row, where:

The first element of the tuple ('Alice') corresponds to the 'Name' column.
The second element (25) corresponds to the 'Age' column.
The third element (50000) corresponds to the 'Salary' column.

By passing a list of tuples, you’re providing Pandas with a clear indication of the row structure.

Under the hood, Pandas converts each tuple into a:

list (in Python 3.x) or
tuple (in Python 2.x)

which is then used to create the DataFrame.

Equivalent representation:

You can achieve the same result using lists instead of tuples:

data = [
    ['Alice', 25, 50000],
    ['Bob', 30, 60000],
    ['Charlie', 35, 70000]
]

df = pd.DataFrame(data, columns=['Name', 'Age', 'Salary'])
print(df)

Or, using dictionaries:

data = [
    {'Name': 'Alice', 'Age': 25, 'Salary': 50000},
    {'Name': 'Bob', 'Age': 30, 'Salary': 60000},
    {'Name': 'Charlie', 'Age': 35, 'Salary': 70000}
]

df = pd.DataFrame(data)
print(df)

All three representations (tuples, lists, and dictionaries) convey the same information to Pandas: a collection of rows with corresponding column values.

1.5. From a NumPy Array

You can create a DataFrame using a NumPy array.

import numpy as np

# Example 5: NumPy array
data = np.array([[25, 50000], [30, 60000], [35, 70000]])
df = pd.DataFrame(data, columns=['Age', 'Salary'])
print(df)

2. Creating DataFrames from Files

2.1. From a CSV File

The most common way to create a DataFrame is by loading data from a CSV file.

# Example 6: CSV file
df = pd.read_csv('path/to/file.csv')
print(df.head())

Additional options:
- Use sep=";" for custom delimiters.
- Use index_col=0 to set a specific column as the index.

2.2. From an Excel File

You can read data from Excel files using pd.read_excel().

# Example 7: Excel file
df = pd.read_excel('path/to/file.xlsx', sheet_name='Sheet1')
print(df.head())

2.3. From a JSON File

Data can also be read from JSON files.

# Example 8: JSON file
df = pd.read_json('path/to/file.json')
print(df.head())

3. Creating DataFrames from SQL Databases

3.1. From SQL Databases

You can fetch data from a SQL database using pd.read_sql() or pd.read_sql_query().

import sqlite3

# Example 9: SQL database (SQLite)
conn = sqlite3.connect('my_database.db')
df = pd.read_sql('SELECT * FROM my_table', conn)
print(df.head())

You can also connect to databases like MySQL, PostgreSQL, etc., using appropriate connectors like mysql-connector or psycopg2.

4. Creating DataFrames from Web Data

4.1. From HTML Tables

You can directly extract tables from a webpage using pd.read_html(). This returns a list of DataFrames for each table found.

# Example 10: Web data (HTML tables)
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations)'
dfs = pd.read_html(url)
print(dfs[0].head())  # Display the first table

5. Creating DataFrames from the Clipboard

5.1. From Clipboard

You can copy data (e.g., from a CSV or Excel table) to your clipboard and read it into a DataFrame using pd.read_clipboard().

# Example 11: Clipboard
df = pd.read_clipboard()
print(df)

6. Creating DataFrames Programmatically

6.1. Empty DataFrame

An empty DataFrame can be created by simply calling the DataFrame constructor.

# Example 12: Empty DataFrame
df = pd.DataFrame()
print(df)

6.2. With Initial Columns

You can create an empty DataFrame with predefined columns.

# Example 13: DataFrame with predefined columns
df = pd.DataFrame(columns=['Name', 'Age', 'Salary'])
print(df)

6.3. Appending Rows to DataFrame

You can programmatically append rows to a DataFrame.

# Example 14: Appending rows
df = pd.DataFrame(columns=['Name', 'Age', 'Salary'])
df = df.append({'Name': 'Alice', 'Age': 25, 'Salary': 50000}, ignore_index=True)
df = df.append({'Name': 'Bob', 'Age': 30, 'Salary': 60000}, ignore_index=True)
print(df)

7. Creating DataFrames from APIs

7.1. From an API Response (JSON)

You can create a DataFrame from JSON responses from APIs.

import requests

# Example 15: API Response (JSON)
url = 'https://api.example.com/data'
response = requests.get(url)
data = response.json()

# Convert JSON data to DataFrame
df = pd.DataFrame(data)
print(df.head())

8. Creating DataFrames with MultiIndex

8.1. MultiIndex DataFrame

You can create a DataFrame with hierarchical indexing (MultiIndex).

# Example 16: MultiIndex
arrays = [['North', 'North', 'South', 'South'], ['City1', 'City2', 'City3', 'City4']]
index = pd.MultiIndex.from_arrays(arrays, names=('Region', 'City'))

data = {'Population': [100000, 200000, 150000, 130000], 'GDP': [300, 400, 500, 600]}
df = pd.DataFrame(data, index=index)
print(df)

There are many ways to create pandas DataFrames from various sources and data structures, including:

Python data structures: lists, dictionaries, tuples, arrays.
Files: CSV, Excel, JSON.
SQL databases: SQLite, MySQL, PostgreSQL.
Web data: HTML tables.
Clipboard: Copy-pasting directly into a DataFrame.
APIs: Fetching JSON data from web APIs

Viewing, accessing and checking DataFrames in Pandas

In pandas, viewing, accessing, checking data types, and generating descriptive statistics are fundamental steps when working with DataFrames. These operations help in understanding the structure, content, and type of the data before performing any data analysis or manipulation.

Let’s explore these concepts with detailed explanations, functions, and examples.

1. Viewing Data in pandas DataFrames

Before performing any transformations or analysis, it’s important to view the data to understand its structure and content.

1.1. Viewing the First or Last Rows

df.head(n): Returns the first n rows (default: 5 rows).
df.tail(n): Returns the last n rows (default: 5 rows).

import pandas as pd

# Example DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 28],
        'Salary': [50000, 60000, 70000, 80000, 55000]}

df = pd.DataFrame(data)

# View the first 3 rows
print(df.head(3))

# View the last 2 rows
print(df.tail(2))

1.2. Viewing Specific Columns

You can select and view specific columns of the DataFrame.

df['column_name']: Access a single column.
df[['col1', 'col2']]: Access multiple columns.

# View the 'Name' column
print(df['Name'])

# View 'Name' and 'Salary' columns
print(df[['Name', 'Salary']])

1.3. Viewing DataFrame Shape

df.shape: Returns a tuple representing the dimensions (rows, columns) of the DataFrame.

# Get the shape of the DataFrame
print(df.shape)  # Output: (5, 3) -> 5 rows, 3 columns

1.4. Viewing DataFrame Columns

df.columns: Returns an Index object containing the column names.

# View column names
print(df.columns)

1.5. Viewing Data Types of Columns

df.dtypes: Returns the data types of each column in the DataFrame.

# View data types of each column
print(df.dtypes)

2. Accessing Data in pandas DataFrames

There are various ways to access specific rows and columns in pandas.

2.1. Accessing Rows by Index

You can access rows using iloc[] (integer-location) or loc[] (label-location).

df.iloc[]: Access rows by their integer position.
df.loc[]: Access rows and columns by label.

# Access the first row (index 0)
print(df.iloc[0])

# Access row where index is 2
print(df.loc[2])

2.2. Accessing Rows and Columns

df.loc[row, column]: Access a specific element using labels.
df.iloc[row, column]: Access a specific element using integer positions.

# Access the 'Salary' of the second row
print(df.loc[1, 'Salary'])

# Access the 'Name' of the third row
print(df.iloc[2, 0])

2.3. Accessing Rows Based on Conditions

You can filter rows based on conditions (boolean indexing).

# Access rows where Age > 30
filtered_df = df[df['Age'] > 30]
print(filtered_df)

3. Checking and Changing Data Types of Columns

Understanding the data types of columns is crucial when performing data analysis, as different operations are valid for different data types.

3.1. Checking Data Types

df.dtypes: Returns the data type of each column.

# Check the data types of all columns
print(df.dtypes)

3.2. Changing Data Types

You can change the data type of columns using astype().

# Convert 'Age' column to float
df['Age'] = df['Age'].astype(float)
print(df.dtypes)

# Convert 'Salary' column to string
df['Salary'] = df['Salary'].astype(str)
print(df.dtypes)

3.3. Converting to Datetime

You can convert columns to datetime using pd.to_datetime().

# Convert 'Date' column to datetime format
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])
print(df.dtypes)

4. Descriptive Statistics in pandas

Descriptive statistics allow you to quickly summarize the central tendency, dispersion, and shape of a dataset’s distribution.

4.1. Generating Descriptive Statistics

df.describe(): Provides summary statistics (count, mean, std, min, max, etc.) for numerical columns.
df.describe(include='all'): Includes both numerical and categorical columns in the summary.

# Summary statistics for numerical columns
print(df.describe())

# Summary statistics including categorical columns
print(df.describe(include='all'))

4.2. Summary for Specific Columns

You can generate descriptive statistics for specific columns.

# Summary statistics for 'Salary' column
print(df['Salary'].describe())

4.3. Counting Unique Values

df['column'].nunique(): Counts the number of unique values in a column.
df['column'].unique(): Lists the unique values in a column.

# Count unique values in the 'Age' column
print(df['Age'].nunique())

# Get unique values in the 'Name' column
print(df['Name'].unique())

4.4. Counting Frequency of Values

df['column'].value_counts(): Returns a series containing counts of unique values.

# Get frequency counts of unique values in 'Age' column
print(df['Age'].value_counts())

4.5. Correlation Between Columns

df.corr(): Computes pairwise correlation of numerical columns.

# Compute correlation matrix
print(df.corr())

5. Functions for Displaying Data

In pandas, there are several functions for displaying data in a structured and readable format.

5.1. `df.head()` and `df.tail()`

These functions display the first or last n rows of the DataFrame.

# Display first 5 rows
print(df.head())

# Display last 3 rows
print(df.tail(3))

5.2. `df.sample()`

Randomly sample rows from the DataFrame.

# Randomly sample 3 rows
print(df.sample(3))

5.3. `df.to_string()`

Converts the DataFrame to a string, useful for displaying the entire DataFrame in environments that truncate the output.

# Display entire DataFrame as a string
print(df.to_string())

5.4. Viewing Large DataFrames

pd.set_option(): Configure pandas options to display more rows or columns in large DataFrames.

# Set pandas option to display more rows
pd.set_option('display.max_rows', 100)

# Set pandas option to display more columns
pd.set_option('display.max_columns', 20)

# Display the DataFrame with updated options
print(df)

5.5. Transposing Data

df.T: Transposes the DataFrame (swap rows and columns).

# Transpose the DataFrame
print(df.T)

Summary

Here’s a quick summary of key operations for viewing, accessing, checking data types, and descriptive statistics in pandas DataFrames:

Viewing and Accessing Data:

head(), tail(): Viewing first/last rows.
shape: DataFrame dimensions.
columns, dtypes: Viewing columns and data types.
iloc[], loc[]: Accessing rows and columns.
boolean indexing: Filtering data based on conditions.

Checking and Changing Data Types:

dtypes: Check column data types.
astype(): Change column data types.
to_datetime(): Convert to datetime format.

Descriptive Statistics:

describe(): Summary statistics.
nunique(), unique(), value_counts(): Working with unique values and frequency counts.
corr(): Correlation between numerical columns.

These operations help in quickly exploring, inspecting, and understanding data before moving on to deeper analysis or transformations. Let me know if you need further details or examples!

Access, Filter, and Retrieve specific rows and columns from a pandas DataFrame- Selection in Pandas

Selection in pandas refers to the various ways in which you can access, filter, and retrieve specific rows and columns from a pandas DataFrame or Series. pandas offers a variety of methods for selecting data, which include basic indexing, label-based indexing, conditional selection, and more advanced techniques.

Let’s cover all the important selection techniques with explanations and examples.

1. Selecting Columns in pandas

1.1. Selecting a Single Column

You can select a single column from a DataFrame by using df['column_name'] or df.column_name.

import pandas as pd

# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 35, 40],
        'Salary': [50000, 60000, 70000, 80000]}

df = pd.DataFrame(data)

# Selecting a single column (two methods)
print(df['Name'])    # Method 1
print(df.Name)       # Method 2

1.2. Selecting Multiple Columns

To select multiple columns, you pass a list of column names to df[].

# Select 'Name' and 'Salary' columns
print(df[['Name', 'Salary']])

2. Selecting Rows in pandas

2.1. Selecting Rows by Index Position with `iloc[]`

iloc[] is used to select rows and columns by their integer position.

df.iloc[row_index, column_index]: Access specific row and column by position.

# Select the first row (index 0)
print(df.iloc[0])

# Select the first two rows
print(df.iloc[:2])

# Select the value from the second row and 'Salary' column
print(df.iloc[1, 2])

2.2. Selecting Rows by Label with `loc[]`

loc[] is used to select rows and columns by labels (index labels or column names).

df.loc[row_label, column_label]: Access specific row and column by label.

# Select the row with index 1 (Bob)
print(df.loc[1])

# Select the 'Salary' of Bob (index 1)
print(df.loc[1, 'Salary'])

# Select the 'Name' and 'Salary' of the row where index is 2
print(df.loc[2, ['Name', 'Salary']])

3. Conditional Selection (Boolean Indexing)

You can use boolean conditions to filter rows based on column values.

3.1. Filtering Rows Based on Conditions

# Select rows where Age is greater than 30
print(df[df['Age'] > 30])

# Select rows where Salary is less than or equal to 60000
print(df[df['Salary'] <= 60000])

3.2. Combining Multiple Conditions

You can combine multiple conditions using & (AND) or | (OR) operators, and you must enclose each condition in parentheses.

# Select rows where Age > 30 and Salary > 60000
print(df[(df['Age'] > 30) & (df['Salary'] > 60000)])

# Select rows where Age < 40 or Salary < 60000
print(df[(df['Age'] < 40) | (df['Salary'] < 60000)])

3.3. Filtering Rows Based on String Conditions

You can filter rows based on string values using str.contains().

# Select rows where 'Name' contains the letter 'a'
print(df[df['Name'].str.contains('a')])

4. Selecting by Index

You can set an index and then select data based on the index.

4.1. Setting a New Index

# Set 'Name' as the new index
df.set_index('Name', inplace=True)

# Now select row by 'Name' (which is the new index)
print(df.loc['Alice'])

4.2. Resetting the Index

You can reset the index back to the default (integer-based) using reset_index().

# Reset the index back to default
df.reset_index(inplace=True)
print(df)

5. Selecting Data with `at[]` and `iat[]`

5.1. Using `at[]` for Scalar Access by Label

at[] is used to access a single value by label (similar to loc[] but optimized for single value access).

# Get the 'Salary' of the row with index 1 (Bob)
print(df.at[1, 'Salary'])

5.2. Using `iat[]` for Scalar Access by Position

iat[] is used to access a single value by position (similar to iloc[] but optimized for single value access).

# Get the 'Salary' of the second row (index 1)
print(df.iat[1, 2])

6. Selecting Rows and Columns by Data Type

6.1. Selecting Columns by Data Type

You can select columns based on their data type using select_dtypes().

# Select only numeric columns
numeric_df = df.select_dtypes(include='number')
print(numeric_df)

# Select only object (string) columns
string_df = df.select_dtypes(include='object')
print(string_df)

7. Selecting Specific Data Based on a List of Labels

You can filter rows or columns based on a list of labels.

7.1. Selecting Rows Based on a List of Indexes

# Select rows with index 0 and 2
print(df.loc[[0, 2]])

7.2. Selecting Columns Based on a List of Column Names

# Select 'Name' and 'Age' columns
print(df[['Name', 'Age']])

8. Slicing Rows and Columns

You can slice rows and columns using iloc[] and loc[] to select a range of rows or columns.

8.1. Slicing Rows with `iloc[]`

# Select the first 3 rows
print(df.iloc[:3])

# Select rows from index 1 to 3 (exclusive)
print(df.iloc[1:3])

8.2. Slicing Columns with `iloc[]`

# Select all rows and the first two columns
print(df.iloc[:, :2])

# Select rows 0-2 and columns 1-2
print(df.iloc[0:3, 1:3])

8.3. Slicing Rows and Columns with `loc[]`

# Select rows 0 to 2 and columns 'Name' and 'Age'
print(df.loc[0:2, ['Name', 'Age']])

9. Selecting Data by Index Ranges

You can select data based on index ranges using loc[].

# Select rows from index 1 to 3 (inclusive)
print(df.loc[1:3])

10. Advanced Selections with `query()`

The query() method allows you to select rows based on complex conditions using SQL-like syntax.

# Select rows where Age > 30 and Salary < 80000
filtered_df = df.query('Age > 30 and Salary < 80000')
print(filtered_df)

11. Masking Data

You can use mask() to mask certain data based on conditions, replacing them with a default value (e.g., NaN).

# Mask Salary values greater than 60000 with NaN
df_masked = df.mask(df['Salary'] > 60000)
print(df_masked)

12. Selecting Missing Data (NaN values)

You can select rows where specific columns have missing data (NaN values).

# Select rows where 'Salary' has NaN values
missing_salary = df[df['Salary'].isna()]
print(missing_salary)

In pandas, selection can be done in various ways depending on what you want to access:

Columns: Select one or more columns using df['col'] or df[['col1', 'col2']].
Rows: Select rows using iloc[] (by position) or loc[] (by label).
Conditional selection: Filter rows based on conditions using boolean indexing (df[df['col'] > value]).
Indexing: Use set_index(), loc[], and reset_index() for selection based on index labels.
Slicing: Slice rows and columns with iloc[] or loc[].
Advanced queries: Use query() for SQL-like row filtering.

These selection techniques are fundamental in data wrangling and allow you to manipulate and analyze your data efficiently. Let me know if you need more examples or specific details on any of the concepts!

Merging, joining, concatenating, comparing, Sorting and Advanced Filtering DataFrames

In pandas, merging, joining, concatenating, and comparing DataFrames are common tasks used to combine or compare datasets. Each operation is tailored for specific use cases, and understanding the difference between them is important for efficient data manipulation.

Let’s explore each operation, covering both basic and advanced use cases, along with detailed examples.

1. Merging DataFrames (`pd.merge()`)

merge() is used for combining two DataFrames based on a common key(s). It’s similar to SQL joins (INNER, LEFT, RIGHT, FULL OUTER).

1.1. Inner Join (Default)

An inner join returns only the rows where the values in the joining columns match in both DataFrames.

import pandas as pd

# Sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Salary': [70000, 80000, 90000, 100000]
})

# Merge with inner join (default)
df_inner = pd.merge(df1, df2, on='ID')
print(df_inner)

Output:

plaintextCopy code   ID     Name  Salary
0   3  Charlie   70000
1   4    David   80000

1.2. Left Join

A left join returns all rows from the left DataFrame and only matching rows from the right DataFrame. Non-matching rows from the right DataFrame are filled with NaN.

pythonCopy code# Left join
df_left = pd.merge(df1, df2, on='ID', how='left')
print(df_left)

Output:

plaintextCopy code   ID     Name   Salary
0   1    Alice      NaN
1   2      Bob      NaN
2   3  Charlie  70000.0
3   4    David  80000.0

1.3. Right Join

A right join returns all rows from the right DataFrame and only matching rows from the left DataFrame.

pythonCopy code# Right join
df_right = pd.merge(df1, df2, on='ID', how='right')
print(df_right)

Output:

plaintextCopy code   ID     Name   Salary
0   3  Charlie  70000.0
1   4    David  80000.0
2   5      NaN  90000.0
3   6      NaN 100000.0

1.4. Full Outer Join

A full outer join returns all rows when there is a match in either the left or right DataFrame. Missing values are filled with NaN.

pythonCopy code# Full outer join
df_outer = pd.merge(df1, df2, on='ID', how='outer')
print(df_outer)

Output:

plaintextCopy code   ID     Name   Salary
0   1    Alice      NaN
1   2      Bob      NaN
2   3  Charlie  70000.0
3   4    David  80000.0
4   5      NaN  90000.0
5   6      NaN 100000.0

1.5. Merging on Multiple Columns

You can merge DataFrames on multiple columns by passing a list to the on parameter.

pythonCopy code# Sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Dept': ['HR', 'Finance', 'IT', 'HR']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Dept': ['IT', 'HR', 'Finance', 'IT'],
    'Salary': [70000, 80000, 90000, 100000]
})

# Merge on multiple columns
df_multi = pd.merge(df1, df2, on=['ID', 'Dept'], how='inner')
print(df_multi)

2. Joining DataFrames (`df.join()`)

The join() method is similar to merge(), but it is used primarily to combine DataFrames based on their index rather than a column key. It’s convenient when you need to join two DataFrames based on their index.

2.1. Left Join by Default

By default, join() performs a left join using the index of the DataFrame.

pythonCopy codedf1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [50000, 60000, 70000]
}, index=[1, 2, 3])

df2 = pd.DataFrame({
    'Dept': ['HR', 'Finance', 'IT']
}, index=[1, 2, 4])

# Left join using join()
df_joined = df1.join(df2)
print(df_joined)

Output:

plaintextCopy code      Name  Salary     Dept
1    Alice   50000       HR
2      Bob   60000  Finance
3  Charlie   70000      NaN

2.2. Specifying Different Types of Joins in `join()`

You can specify the type of join using the how parameter (e.g., ‘left’, ‘right’, ‘inner’, ‘outer’).

pythonCopy code# Outer join using join()
df_outer_join = df1.join(df2, how='outer')
print(df_outer_join)

3. Concatenating DataFrames (`pd.concat()`)

The concat() function is used to concatenate DataFrames along a particular axis, either vertically (by rows) or horizontally (by columns).

3.1. Concatenating Vertically (Default)

By default, concat() stacks DataFrames vertically (row-wise).

pythonCopy codedf1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Salary': [70000, 80000]})

# Concatenate vertically (row-wise)
df_concat = pd.concat([df1, df2])
print(df_concat)

Output:

plaintextCopy code      Name  Salary
0    Alice   50000
1      Bob   60000
0  Charlie   70000
1    David   80000

3.2. Concatenating Horizontally (Column-wise)

You can concatenate DataFrames horizontally (by columns) using axis=1.

pythonCopy code# Concatenate horizontally (column-wise)
df_concat_cols = pd.concat([df1, df2], axis=1)
print(df_concat_cols)

Output:

plaintextCopy code      Name  Salary     Name  Salary
0    Alice   50000  Charlie   70000
1      Bob   60000    David   80000

3.3. Resetting Index After Concatenation

The index is preserved when concatenating, but you can reset the index using ignore_index=True.

pythonCopy code# Concatenate vertically and reset index
df_concat_reset = pd.concat([df1, df2], ignore_index=True)
print(df_concat_reset)

Output:

plaintextCopy code      Name  Salary
0    Alice   50000
1      Bob   60000
2  Charlie   70000
3    David   80000

4. Comparing DataFrames (`pd.DataFrame.compare()`)

compare() is used to compare two DataFrames and highlight the differences between them.

4.1. Comparing Two DataFrames

pythonCopy code# Sample DataFrames
df1 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [50000, 60000, 70000]
})

df2 = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [50000, 65000, 70000]
})

# Compare the two DataFrames
df_compare = df1.compare(df2)
print(df_compare)

Output:

plaintextCopy code        Salary
          self    other
1       60000     65000

4.2. Including All Columns in the Comparison

You can include all columns in the comparison, even those without differences, using the keep_equal=True parameter.

pythonCopy code# Compare and include equal values
df_compare_all = df1.compare(df2, keep_equal=True)
print(df_compare_all)

5. Sorting in pandas

You can sort a DataFrame by one or more columns in ascending or descending order using sort_values().

5.1. Sorting by a Single Column

sort_values(by='column_name'): Sort the DataFrame by a single column in ascending order (default).

pythonCopy code# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'Salary': [50000, 60000, 70000, 80000]
})

# Sort by 'Age' in ascending order
df_sorted = df.sort_values(by='Age')
print(df_sorted)

Output:

plaintextCopy code      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    David   40   80000

5.2. Sorting by Multiple Columns

You can sort by multiple columns, specifying whether each column should be sorted in ascending or descending order.

pythonCopy code# Sort by 'Age' in ascending order and 'Salary' in descending order
df_sorted_multi = df.sort_values(by=['Age', 'Salary'], ascending=[True, False])
print(df_sorted_multi)

Output:

      Name  Age  Salary
0    Alice   25   50000
1      Bob   30   60000
2  Charlie   35   70000
3    David   40   80000

5.3. Sorting by Index

sort_index(): Sort the DataFrame by its index.

# Sort by index
df_sorted_index = df.sort_index()
print(df_sorted_index)

6. Advanced Filtering in pandas

Advanced filtering allows you to retrieve specific subsets of data based on complex conditions, such as combining multiple conditions, filtering by partial string matches, and more.

6.1. Filtering Rows Based on Multiple Conditions

You can filter DataFrames based on multiple conditions by combining them with & (AND) or | (OR). Each condition must be enclosed in parentheses.

# Filter rows where Age > 30 and Salary > 60000
df_filtered = df[(df['Age'] > 30) & (df['Salary'] > 60000)]
print(df_filtered)

Output:

      Name  Age  Salary
2  Charlie   35   70000
3    David   40   80000

6.2. Filtering Rows Based on String Conditions

You can filter rows based on string conditions, such as checking if a column contains a certain substring using str.contains().

# Filter rows where 'Name' contains the letter 'a'
df_filtered_str = df[df['Name'].str.contains('a')]
print(df_filtered_str)

Output:

      Name  Age  Salary
0    Alice   25   50000
2  Charlie   35   70000
3    David   40   80000

6.3. Filtering Rows Based on Ranges

You can filter rows where column values fall within a certain range using between().

pythonCopy code# Filter rows where 'Salary' is between 55000 and 75000
df_filtered_range = df[df['Salary'].between(55000, 75000)]
print(df_filtered_range)

Output:

plaintextCopy code      Name  Age  Salary
1      Bob   30   60000
2  Charlie   35   70000

6.4. Using `query()` for Advanced Filtering

query() allows you to use SQL-like queries to filter your DataFrame based on complex conditions.

pythonCopy code# Use query to filter rows where Age > 30 and Salary < 80000
df_filtered_query = df.query('Age > 30 and Salary < 80000')
print(df_filtered_query)

Output:

plaintextCopy code      Name  Age  Salary
2  Charlie   35   70000

Combining Sorting, Filtering, and Merging

Now, let’s see how sorting and filtering can be combined with merge, join, concatenation, and comparison operations.

6.5. Filtering and Sorting with Merge

After merging two DataFrames, you can filter and sort the result.

pythonCopy code# Sample DataFrames for merging
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Salary': [70000, 80000, 90000, 100000]
})

# Merge DataFrames (inner join)
df_merged = pd.merge(df1, df2, on='ID', how='inner')

# Filter rows where Salary > 75000 and sort by 'Salary'
df_merged_filtered_sorted = df_merged[df_merged['Salary'] > 75000].sort_values(by='Salary', ascending=False)
print(df_merged_filtered_sorted)

Output:

plaintextCopy code   ID   Name  Salary
1   4  David   80000
0   3  Charlie  70000

6.6. Filtering After Concatenation

You can filter rows after concatenating multiple DataFrames.

pythonCopy code# Sample DataFrames for concatenation
df1 = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Salary': [50000, 60000]})
df2 = pd.DataFrame({'Name': ['Charlie', 'David'], 'Salary': [70000, 80000]})

# Concatenate DataFrames
df_concat = pd.concat([df1, df2])

# Filter rows where 'Salary' > 60000
df_concat_filtered = df_concat[df_concat['Salary'] > 60000]
print(df_concat_filtered)

Output:

plaintextCopy code      Name  Salary
2  Charlie   70000
3    David   80000

Full Example: Sorting, Filtering, Merging, and Concatenation

Here’s a complete example that demonstrates how to use sorting, filtering, and merging in a real-world scenario:

# Sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Dept': ['HR', 'IT', 'Finance', 'Marketing'],
    'Salary': [70000, 80000, 90000, 100000]
})

# Merge the DataFrames
df_merged = pd.merge(df1, df2, on='ID', how='outer')

# Filter rows where Salary > 60000 and Dept is 'HR' or 'IT'
df_filtered = df_merged[(df_merged['Salary'] > 60000) & (df_merged['Dept'].isin(['HR', 'IT']))]

# Sort the result by 'Salary' in descending order
df_sorted = df_filtered.sort_values(by='Salary', ascending=False)

print(df_sorted)

Output:

plaintextCopy code   ID     Name Dept  Salary
1   4    David   IT   80000
0   3  Charlie   HR   70000

Sorting (sort_values()): Sort by one or more columns, or by index using sort_index().
Advanced Filtering: Filter rows based on complex conditions using boolean indexing, str.contains(), between(), and query().
Combining Sorting and Filtering with Merge/Join: You can merge/join DataFrames and then apply sorting and filtering to the combined dataset.
Concatenation: After concatenation, apply filtering to refine the data.

These operations allow you to efficiently manipulate data using pandas, combining data from multiple sources and selecting, sorting, and filtering relevant information. Let me know if you need more examples or specific clarifications!

Pivoting, melting, and transposing operations in pandas

In this section, we will dive deeper into the pivoting, melting, and transposing operations in pandas. These are essential for reshaping data in DataFrames, enabling us to structure data for analysis, reporting, or visualization purposes.

1. Pivoting in pandas

Pivoting refers to reshaping data by turning unique values from one column into multiple columns (wide format). It allows transforming long-form data (where a single row represents a data point) into wide-form data (where multiple columns represent different aspects of the same data point).

1.1. `pivot()` Function

The pivot() function reshapes data from long to wide format by specifying an index, columns, and values.

Syntax:

pythonCopy codedf.pivot(index='row_label', columns='column_label', values='value_column')

index: The column(s) to use as the new DataFrame’s index.
columns: The column(s) to spread across the new DataFrame as headers.
values: The values to fill the DataFrame with, based on the pivot.

1.2. Example of Pivoting

pythonCopy codeimport pandas as pd

# Sample data in long format
df = pd.DataFrame({
    'Name': ['Alice', 'Alice', 'Bob', 'Bob'],
    'Year': [2020, 2021, 2020, 2021],
    'Sales': [1000, 1100, 2000, 2100]
})

# Pivot the DataFrame to show Sales for each Year as columns
pivot_df = df.pivot(index='Name', columns='Year', values='Sales')
print(pivot_df)

Output:

yamlCopy codeYear     2020  2021
Name
Alice    1000  1100
Bob      2000  2100

1.3. Use Cases for Pivoting

Summarizing Data: Pivoting is useful when you want to summarize data across multiple categories and present the data in a more readable, wide format.
Visualizing Data: It is often used before creating plots or charts, especially when visualizing time-series data for multiple entities.
Data Aggregation: Pivoting is helpful when you want to compare multiple measures across different dimensions.

1.4. Handling Duplicates in Pivoting

If there are duplicate entries for the same index-column combination, pivot() will raise an error. To handle aggregation of duplicates, use pivot_table().

2. Pivot Table in pandas

pivot_table() works similarly to pivot(), but it allows aggregation when there are duplicate values. It can be seen as a more flexible version of pivot().

2.1. Syntax for `pivot_table()`

pythonCopy codedf.pivot_table(index='row_label', columns='column_label', values='value_column', aggfunc='mean')

aggfunc: The aggregation function to apply (e.g., mean, sum, count, max, etc.). Default is mean.

2.2. Example of `pivot_table()`

pythonCopy codedf = pd.DataFrame({
    'Name': ['Alice', 'Alice', 'Bob', 'Bob', 'Bob'],
    'Year': [2020, 2021, 2020, 2021, 2020],
    'Sales': [1000, 1100, 2000, 2100, 1800]
})

# Pivot table with sum aggregation for duplicate entries
pivot_table_df = df.pivot_table(index='Name', columns='Year', values='Sales', aggfunc='sum')
print(pivot_table_df)

Output:

yamlCopy codeYear     2020   2021
Name
Alice    1000   1100
Bob      3800   2100

2.3. Use Cases for `pivot_table()`

Handling Duplicates: It is useful when you want to handle duplicate entries by applying an aggregation function (e.g., summing sales across multiple entries).
Data Summarization: When performing group-by like operations and you want to summarize data into a matrix.
Flexible Aggregation: You can specify various aggregation functions to summarize data (e.g., sum, average, count).

3. Melting in pandas

Melting is the opposite of pivoting. It reshapes wide-form data into long-form data by “unpivoting” columns back into rows. This is useful when you want to bring multiple columns into a single column, often for plotting or further analysis.

3.1. `melt()` Function

The melt() function reshapes a DataFrame by transforming columns into rows, returning long-form data.

Syntax:

pythonCopy codedf.melt(id_vars=['fixed_column'], value_vars=['columns_to_unpivot'], var_name='variable_name', value_name='value_name')

id_vars: Columns that should remain fixed (not melted).
value_vars: Columns to unpivot (convert from wide to long).
var_name: The name of the new “variable” column.
value_name: The name of the new “value” column.

3.2. Example of Melting

pythonCopy codedf = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    '2020': [1000, 2000],
    '2021': [1100, 2100]
})

# Melt the DataFrame from wide to long format
melted_df = df.melt(id_vars=['Name'], value_vars=['2020', '2021'], var_name='Year', value_name='Sales')
print(melted_df)

Output:

yamlCopy code    Name  Year  Sales
0  Alice  2020   1000
1    Bob  2020   2000
2  Alice  2021   1100
3    Bob  2021   2100

3.3. Use Cases for Melting

Preparing Data for Plotting: Many plotting libraries prefer data in long format, where each row is a single data point.
Unpivoting Wide Data: When you have data spread across columns and you want to analyze or process it row-wise.
Handling Multiple Time Periods: When dealing with time series data, you often have columns for each year or month. Melting can help consolidate these columns into a single “period” column.

4. Transposing in pandas

Transposing is the process of swapping rows and columns in a DataFrame, where the index becomes the column headers and vice versa.

4.1. `transpose()` or `T`

The transpose() function or T attribute is used to interchange rows and columns.

Syntax:

pythonCopy codedf.transpose()  # or simply df.T

4.2. Example of Transposing

pythonCopy codedf = pd.DataFrame({
    'Name': ['Alice', 'Bob'],
    'Age': [25, 30],
    'Salary': [50000, 60000]
})

# Transpose the DataFrame
transposed_df = df.transpose()
print(transposed_df)

Output:

markdownCopy code              0      1
Name      Alice    Bob
Age          25     30
Salary    50000  60000

4.3. Use Cases for Transposing

Data Presentation: Sometimes transposing data can make it easier to present or visualize. For example, showing each record’s details in columns rather than rows.
Analyzing Wide Data: If your data has more columns than rows, transposing can help make the dataset easier to inspect.
Pivot-Like Operations: In scenarios where you want to transform row-wise data into column-wise data quickly.

Summary of Pivoting, Melting, and Transposing

Operation	Function	Description	Use Case
Pivoting	`pivot()`	Reshapes data from long to wide format by turning unique values from a column into new column headers.	Summarizing data, comparing across categories, preparing for visualization.
Pivot Table	`pivot_table()`	Similar to `pivot()`, but allows aggregation for duplicate values and provides more flexibility.	Summarizing data, aggregating data with functions (sum, mean, etc.).
Melting	`melt()`	Unpivots data from wide to long format, turning columns into rows.	Preparing data for plotting, handling time series, unpivoting wide-form data.
Transposing	`transpose()`	Swaps rows and columns, turning the DataFrame’s index into columns and vice versa.	Analyzing data from different perspectives, data presentation, pivot-like operations.

Conclusion

Pivoting reshapes long data into wide data, which is useful for summarization and comparison across categories.
Pivot tables extend pivoting by allowing aggregation when duplicates are present.
Melting converts wide data into long data, preparing it for time-series analysis or plotting.
Transposing swaps rows and columns to change the perspective of your data, helping with analysis or presentation.

These reshaping operations are fundamental to data wrangling and make pandas a versatile library for handling complex data structures. Let me know if you need more details or examples on any specific topic!

Doubts / Questions / Usecase

====================================================

How to find third best saliered employee in sales department in Python?

To find the third highest salaried employee in the Sales department in Python using pandas, you can follow these steps:

Steps:

Filter the employees who belong to the Sales department.
Sort the employees by their salary in descending order.
Extract the third employee based on the salary ranking.

Example Code:

import pandas as pd

# Sample data for employees
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
    'Department': ['Sales', 'Sales', 'Marketing', 'Sales', 'HR', 'Sales'],
    'Salary': [50000, 60000, 70000, 80000, 90000, 55000]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Step 1: Filter employees in the Sales department
sales_df = df[df['Department'] == 'Sales']

# Step 2: Sort employees by Salary in descending order
sorted_sales_df = sales_df.sort_values(by='Salary', ascending=False)

# Step 3: Find the third highest salaried employee
third_highest_salary = sorted_sales_df.iloc[2]  # Index 2 corresponds to the third entry

# Output the result
print("Third highest salaried employee in Sales department:")
print(third_highest_salary)

Explanation:

Filter: The DataFrame is filtered to include only rows where the Department is Sales.
Sort: The filtered DataFrame is sorted by the Salary column in descending order, so the highest salaries appear first.
Extract: The third entry in the sorted DataFrame is selected using iloc[2] (0-based indexing).

Sample Output:

Third highest salaried employee in Sales department:
Name         Frank
Department    Sales
Salary       55000
Name: 5, dtype: object

Alternate Approach (If There Are Duplicate Salaries):

If salaries are not unique, and you want to find the third distinct highest salary, you can do the following:

# Step 1: Filter employees in the Sales department
sales_df = df[df['Department'] == 'Sales']

# Step 2: Get unique salaries, sort them in descending order
unique_salaries = sales_df['Salary'].drop_duplicates().sort_values(ascending=False)

# Step 3: Get the third highest salary
third_highest_salary_value = unique_salaries.iloc[2]  # Third highest salary value

# Step 4: Find the employee(s) with this salary
third_highest_employee = sales_df[sales_df['Salary'] == third_highest_salary_value]

# Output the result
print("Third highest salaried employee(s) in Sales department:")
print(third_highest_employee)

This approach ensures you get the third distinct highest salary, in case there are multiple employees with the same salary value.

can we use something like limiting the dataframe output?

Yes, in pandas, you can limit the number of rows returned from a DataFrame by using head(), tail(), or iloc[]. This is useful when you want to fetch only a limited number of results without needing to manually slice the DataFrame.

How to Limit DataFrame Output in pandas

head(n): Returns the first n rows of the DataFrame.
iloc[:n]: Selects the first n rows by integer-location based indexing.
Using iloc[] with sort_values(): When limiting rows after sorting.

Example: Limiting DataFrame Output After Sorting

Here’s how you can find the third highest salaried employee in the Sales department and limit the DataFrame output to only a few rows:

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank'],
    'Department': ['Sales', 'Sales', 'Marketing', 'Sales', 'HR', 'Sales'],
    'Salary': [50000, 60000, 70000, 80000, 90000, 55000]
}

df = pd.DataFrame(data)

# Step 1: Filter employees in the Sales department
sales_df = df[df['Department'] == 'Sales']

# Step 2: Sort employees by Salary in descending order
sorted_sales_df = sales_df.sort_values(by='Salary', ascending=False)

# Step 3: Limit the result to the first 3 rows using head()
limited_sales_df = sorted_sales_df.head(3)

# Output the result (third highest employee)
print("Top 3 salaried employees in Sales department:")
print(limited_sales_df)

Explanation:

head(3): Limits the DataFrame to the first 3 rows after sorting by salary.
You can use iloc[2] to get the third-highest salary after limiting the result to just the top 3.

Sample Output:

Top 3 salaried employees in Sales department:
    Name Department  Salary
3  David      Sales   80000
1    Bob      Sales   60000
5  Frank      Sales   55000

Alternate Limiting Using `iloc[]` for Exact Indexing:

You can directly use iloc[] to limit and access the rows you want:

# Fetch only the third row (third highest salaried employee) after sorting
third_highest_salary = sorted_sales_df.iloc[2]

print("Third highest salaried employee in Sales department:")
print(third_highest_salary)

Sample Output:

Third highest salaried employee in Sales department:
Name         Frank
Department    Sales
Salary       55000
Name: 5, dtype: object

Use Case of Limiting the DataFrame:

Limiting is useful when:

You want to display the top N results (e.g., top 5 highest salaries).
You are dealing with large DataFrames and only need to inspect a small portion of the data.
You want to optimize performance when working with large datasets by avoiding fetching more rows than needed.

Here’s a pandas cheat s To save a pandas DataFrame to disk, you can use several built-in methods to export data in various formats, such as CSV, Excel, JSON, and SQL. Here are the most common options:

1. Saving DataFrame as CSV

The to_csv() function exports the DataFrame to a CSV (Comma-Separated Values) file.

Syntax:

pythonCopy codedf.to_csv('file_name.csv', index=False)

index=False: Excludes row indices in the saved file.
sep=',': (Optional) Specify the delimiter (default is a comma).

Example:

pythonCopy codeimport pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# Save DataFrame as CSV
df.to_csv('output.csv', index=False)

2. Saving DataFrame as Excel

The to_excel() function exports the DataFrame to an Excel (.xlsx) file.

Syntax:

pythonCopy codedf.to_excel('file_name.xlsx', sheet_name='Sheet1', index=False)

sheet_name='Sheet1': (Optional) Specify the sheet name.
index=False: Excludes row indices in the saved file.

Example:

pythonCopy code# Save DataFrame as Excel file
df.to_excel('output.xlsx', sheet_name='Data', index=False)

Note: The to_excel() function requires openpyxl for .xlsx files, so make sure it’s installed (pip install openpyxl).

3. Saving DataFrame as JSON

The to_json() function exports the DataFrame to a JSON (JavaScript Object Notation) file, useful for web applications.

Syntax:

pythonCopy codedf.to_json('file_name.json', orient='records')

orient='records': Each row is exported as a separate JSON object.

Example:

pythonCopy code# Save DataFrame as JSON file
df.to_json('output.json', orient='records')

4. Saving DataFrame to SQL Database

The to_sql() function exports the DataFrame to a SQL table in a relational database.

Syntax:

pythonCopy codedf.to_sql('table_name', con=connection, if_exists='replace', index=False)

con: The SQLAlchemy engine or database connection object.
if_exists='replace': Replaces the table if it already exists; use 'append' to add data to the existing table.
index=False: Excludes row indices in the table.

Example:

pythonCopy codefrom sqlalchemy import create_engine

# Create a database connection (SQLite in this example)
engine = create_engine('sqlite:///my_database.db')

# Save DataFrame to SQL table
df.to_sql('people', con=engine, if_exists='replace', index=False)

5. Saving DataFrame as Parquet

The to_parquet() function exports the DataFrame to a Parquet file, which is a popular columnar storage format, especially in big data environments.

Syntax:

pythonCopy codedf.to_parquet('file_name.parquet')

Example:

# Save DataFrame as Parquet file
df.to_parquet('output.parquet')

Note: Saving as Parquet requires pyarrow or fastparquet libraries (pip install pyarrow or pip install fastparquet).

6. Saving DataFrame as HTML

The to_html() function saves the DataFrame as an HTML table, useful for displaying data on web pages.

Syntax:

df.to_html('file_name.html', index=False)

Example:

# Save DataFrame as HTML file
df.to_html('output.html', index=False)

Summary

Format	Method	Example Syntax
CSV	`to_csv()`	`df.to_csv('file.csv', index=False)`
Excel	`to_excel()`	`df.to_excel('file.xlsx', index=False)`
JSON	`to_json()`	`df.to_json('file.json', orient='records')`
SQL	`to_sql()`	`df.to_sql('table', con=engine, index=False)`
Parquet	`to_parquet()`	`df.to_parquet('file.parquet')`
HTML	`to_html()`	`df.to_html('file.html', index=False)`

Each method provides a straightforward way to export pandas DataFrames in various formats depending on your needs. Let me know if you need further details!heet to quickly revise the most important concepts, commands, and functions. It’s structured by category for easy navigation.

1. Basics and Setup

Operation	Command	Description
Import pandas	`import pandas as pd`	Import pandas library.
Read CSV	`pd.read_csv('file.csv')`	Read CSV file into DataFrame.
Read Excel	`pd.read_excel('file.xlsx')`	Read Excel file into DataFrame.
Save to CSV	`df.to_csv('file.csv')`	Save DataFrame to CSV.
Save to Excel	`df.to_excel('file.xlsx')`	Save DataFrame to Excel.

2. Data Inspection

Operation	Command	Description
Head and Tail	`df.head(n)`, `df.tail(n)`	Display first/last `n` rows.
Shape of DataFrame	`df.shape`	Get the dimensions (rows, columns) of DataFrame.
Info Summary	`df.info()`	Overview of DataFrame (non-null count, dtypes).
Column Names	`df.columns`	List all column names.
Data Types	`df.dtypes`	Check data types of all columns.
Describe Data	`df.describe()`	Summary statistics for numerical columns.
Unique Values	`df['col'].unique()`	Get unique values in a column.
Value Counts	`df['col'].value_counts()`	Count unique values in a column.

3. Selecting and Indexing

Operation	Command	Description
Select Column(s)	`df['col']`, `df[['col1', 'col2']]`	Select one or multiple columns.
Select Row by Label	`df.loc['label']`	Select row by index label.
Select Row by Position	`df.iloc[position]`	Select row by integer position.
Slice Rows	`df[5:10]`	Slice rows from index 5 to 10.
Conditional Selection	`df[df['col'] > value]`	Filter rows based on condition.
Set Index	`df.set_index('col', inplace=True)`	Set column as index.
Reset Index	`df.reset_index(inplace=True)`	Reset index to default.
Multi-Indexing	`pd.MultiIndex.from_arrays([list1, list2])`	Create hierarchical indexing.

4. Data Cleaning

Operation	Command	Description
Rename Columns	`df.rename(columns={'old': 'new'})`	Rename columns in DataFrame.
Drop Columns	`df.drop(['col1', 'col2'], axis=1)`	Drop specified columns.
Drop Rows	`df.drop([0, 1], axis=0)`	Drop specified rows.
Drop Missing Values	`df.dropna()`	Drop rows with `NaN` values.
Fill Missing Values	`df.fillna(value)`	Fill `NaN` values with a specified value.
Replace Values	`df.replace({old: new})`	Replace values throughout DataFrame.
Remove Duplicates	`df.drop_duplicates()`	Remove duplicate rows.
Change Data Type	`df['col'] = df['col'].astype(type)`	Change data type of a column.
Handle Outliers	`df[df['col'] < threshold]`	Filter outliers based on threshold.

5. Data Transformation

Operation	Command	Description
Apply Function to Column	`df['col'].apply(func)`	Apply function to each element in column.
Lambda Function	`df['col'].apply(lambda x: x + 2)`	Apply lambda function to a column.
Map Values	`df['col'].map({old: new})`	Map values in column to new values.
Replace with Conditions	`df['col'] = df['col'].where(cond, new_val)`	Conditionally replace values.
Binning Data	`pd.cut(df['col'], bins)`	Bin data into intervals.
Standardize/Normalize	`(df['col'] - df['col'].mean()) / df['col'].std()`	Standardize column.

6. Aggregation and Grouping

Operation	Command	Description
Sum/Mean/Max/Min	`df['col'].sum()`, `df['col'].mean()`	Basic aggregation functions on columns.
GroupBy	`df.groupby('col').sum()`	Group by column and apply aggregation.
Multiple Aggregations	`df.groupby('col').agg({'col1': 'sum', 'col2': 'mean'})`	Multiple aggregations on grouped data.
Pivot Table	`df.pivot_table(index='col1', columns='col2', values='col3', aggfunc='mean')`	Create pivot table.
Cumulative Sum	`df['col'].cumsum()`	Calculate cumulative sum.
Rolling Calculations	`df['col'].rolling(window=3).mean()`	Apply rolling calculations (moving average, etc.).

7. Merging, Joining, Concatenating

Operation	Command	Description
Concatenate DataFrames	`pd.concat([df1, df2])`	Concatenate DataFrames along rows or columns.
Merge DataFrames	`pd.merge(df1, df2, on='key')`	Merge DataFrames based on key.
Left/Right/Inner/Outer Join	`pd.merge(df1, df2, how='left', on='key')`	Perform different types of joins.
Join on Index	`df1.join(df2, on='key')`	Join using index as the key.
Append DataFrame	`df1.append(df2)`	Append rows of `df2` to `df1`.

8. Reshaping Data

Operation	Command	Description
Pivot	`df.pivot(index='col1', columns='col2', values='col3')`	Reshape data (long to wide).
Pivot Table	`df.pivot_table(index='col1', columns='col2', values='col3', aggfunc='mean')`	Reshape with aggregation.
Melt	`df.melt(id_vars=['col1'], value_vars=['col2'])`	Unpivot from wide to long format.
Transpose	`df.T`	Transpose rows and columns.
Stack	`df.stack()`	Stack columns to row MultiIndex.
Unstack	`df.unstack()`	Unstack row index to columns.

9. Working with Dates

Operation	Command	Description
Convert to Datetime	`pd.to_datetime(df['col'])`	Convert column to datetime format.
Extract Year/Month/Day	`df['col'].dt.year`, `df['col'].dt.month`	Extract parts of date.
Date Range	`pd.date_range(start, end, freq='D')`	Generate a range of dates.
Time Difference	`(df['end'] - df['start']).dt.days`	Calculate time difference in days.
Resample	`df.resample('M').mean()`	Resample time series to a specific frequency.

10. String Operations

Operation	Command	Description
Convert to Lowercase	`df['col'].str.lower()`	Convert strings to lowercase.
Contains Substring	`df['col'].str.contains('substring')`	Check if strings contain a substring.
Replace Substring	`df['col'].str.replace('old', 'new')`	Replace substring in strings.
Split Strings	`df['col'].str.split(' ')`	Split strings on a delimiter.
Length of Strings	`df['col'].str.len()`	Get length of each string in a column.

11. Useful Plotting

Operation	Command	Description
Line Plot	`df.plot.line()`	Plot data as a line chart.
Bar Plot	`df.plot.bar()`	Plot data as a bar chart.
Histogram	`df['col'].plot.hist()`	Plot data as a histogram.
Box Plot	`df.plot.box()`	Plot data as a box plot.
Scatter Plot	`df.plot.scatter(x='col1', y='col2')`	Scatter plot between two columns.

12. Exporting Data

Operation	Command	Description
To CSV	`df.to_csv('file.csv', index=False)`	Export DataFrame to CSV file.
To Excel	`df.to_excel('file.xlsx', index=False)`	Export DataFrame to Excel file.
To JSON	`df.to_json('file.json')`	Export DataFrame to JSON file.
To SQL	`df.to_sql('table_name', con=connection)`	Export DataFrame to SQL database.

This cheat sheet covers the essentials of pandas for data manipulation, analysis, and exporting results. With this reference, you can quickly review and apply the most common pandas operations.

← PySpark Projects:- Scenario Based Complex ETL projects Part3Python Pandas Series Tutorial- Usecases, Cheatcode Sheet to revise →

Written By

undefined

PySpark SQL API Programming- How To, Approaches, Optimization

Feb 9, 2025 | Pyspark

In PySpark, DataFrame transformations and operations can be efficiently handled using two main approaches: 1️⃣ PySpark SQL API Programming (Temp Tables / Views) Each transformation step can be written as a SQL query. Intermediate results can be stored as temporary…

How to Solve a Coding Problem in Python? Step to Step Guide?

Feb 1, 2025 | Python

Solving coding problems efficiently requires a structured approach. Here’s a step-by-step guide along with shortcuts and pseudocode tips. 📌 Step 1: Understand the Problem Clearly Read the problem statement carefully Identify: Input format (list, string, integer,…

Python Built-in Iterables: Complete Guide with Use Cases & Challenges

Feb 1, 2025 | Python

What are Iterables? An iterable is any object that can return an iterator, meaning it can be looped over using for loops or passed to functions like map(), filter(), etc. 🔹 List of Built-in Iterables in Python Python provides several built-in iterable objects:…

Pandas operations, functions, and use cases ranging from basic operations like filtering, merging, and sorting, to more advanced topics like handling missing data, error handling

1. Series

Characteristics of Series:

Creating a Series:

Output:

2. DataFrame

Characteristics of DataFrame:

Creating a DataFrame:

Output:

3. Index

Characteristics of Index:

Creating an Index:

Output:

4. MultiIndex

Characteristics of MultiIndex:

Creating a MultiIndex DataFrame:

Output:

5. Panel (Deprecated)

Alternative:

Summary Table

Use Cases for Each Data Structure

Creating DataFrames from various sources.

1. Creating DataFrames from Python Data Structures

1.1. From a Dictionary

1.2. From a List of Lists

1.3. From a List of Dictionaries

1.4. From Tuples

1.5. From a NumPy Array

2. Creating DataFrames from Files

2.1. From a CSV File

2.2. From an Excel File

2.3. From a JSON File

3. Creating DataFrames from SQL Databases

3.1. From SQL Databases

4. Creating DataFrames from Web Data

4.1. From HTML Tables

5. Creating DataFrames from the Clipboard

5.1. From Clipboard

6. Creating DataFrames Programmatically

6.1. Empty DataFrame

6.2. With Initial Columns

6.3. Appending Rows to DataFrame

7. Creating DataFrames from APIs

7.1. From an API Response (JSON)

8. Creating DataFrames with MultiIndex

8.1. MultiIndex DataFrame

Viewing, accessing and checking DataFrames in Pandas

1. Viewing Data in pandas DataFrames

1.1. Viewing the First or Last Rows

1.2. Viewing Specific Columns

1.3. Viewing DataFrame Shape

1.4. Viewing DataFrame Columns

1.5. Viewing Data Types of Columns

2. Accessing Data in pandas DataFrames

2.1. Accessing Rows by Index

2.2. Accessing Rows and Columns

2.3. Accessing Rows Based on Conditions

3. Checking and Changing Data Types of Columns

3.1. Checking Data Types

3.2. Changing Data Types

3.3. Converting to Datetime

4. Descriptive Statistics in pandas

4.1. Generating Descriptive Statistics

4.2. Summary for Specific Columns

4.3. Counting Unique Values

4.4. Counting Frequency of Values

4.5. Correlation Between Columns

5. Functions for Displaying Data

5.1. df.head() and df.tail()

5.2. df.sample()

5.3. df.to_string()

5.4. Viewing Large DataFrames

5.5. Transposing Data

Summary

Viewing and Accessing Data:

Checking and Changing Data Types:

Descriptive Statistics:

Access, Filter, and Retrieve specific rows and columns from a pandas DataFrame- Selection in Pandas

1. Selecting Columns in pandas

1.1. Selecting a Single Column

5.1. `df.head()` and `df.tail()`

5.2. `df.sample()`

5.3. `df.to_string()`

2.1. Selecting Rows by Index Position with `iloc[]`

2.2. Selecting Rows by Label with `loc[]`

5. Selecting Data with `at[]` and `iat[]`

5.1. Using `at[]` for Scalar Access by Label

5.2. Using `iat[]` for Scalar Access by Position

8.1. Slicing Rows with `iloc[]`

8.2. Slicing Columns with `iloc[]`

8.3. Slicing Rows and Columns with `loc[]`

10. Advanced Selections with `query()`

1. Merging DataFrames (`pd.merge()`)

2. Joining DataFrames (`df.join()`)

2.2. Specifying Different Types of Joins in `join()`

3. Concatenating DataFrames (`pd.concat()`)

4. Comparing DataFrames (`pd.DataFrame.compare()`)

6.4. Using `query()` for Advanced Filtering

1.1. `pivot()` Function

2.1. Syntax for `pivot_table()`

2.2. Example of `pivot_table()`

2.3. Use Cases for `pivot_table()`