Machine-Learning/Data-Science/Comprehensive Guide to Python Pandas.md at main · xbeat/Machine-Learning

Comprehensive Guide to Python Pandas

Slide 1: Introduction to Python Pandas

Pandas is a powerful data manipulation library for Python. It provides high-performance, easy-to-use data structures and tools for working with structured data. Let's start by importing pandas and creating a simple DataFrame.

import pandas as pd

# Create a simple DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'London', 'Paris']}
df = pd.DataFrame(data)

print(df)

Output:

     Name  Age     City
0   Alice   25  New York
1     Bob   30    London
2  Charlie  35     Paris

Slide 2: Reading Data with Pandas

Pandas can read data from various file formats. Let's read a CSV file and display its contents.

# Read a CSV file
df = pd.read_csv('sample_data.csv')

# Display the first few rows
print(df.head())

# Display basic information about the DataFrame
print(df.info())

Output:

   ID   Name  Age     City
0   1  Alice   25  New York
1   2    Bob   30    London
2   3  Charlie 35     Paris

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   ID      3 non-null      int64 
 1   Name    3 non-null      object
 2   Age     3 non-null      int64 
 3   City    3 non-null      object
dtypes: int64(2), object(2)
memory usage: 224.0+ bytes

Slide 3: Data Selection and Indexing

Pandas offers various ways to select and index data. Let's explore some common methods.

# Select a single column
print(df['Name'])

# Select multiple columns
print(df[['Name', 'Age']])

# Select rows by index
print(df.loc[1])

# Select rows by condition
print(df[df['Age'] > 30])

# Select specific rows and columns
print(df.loc[df['Age'] > 30, ['Name', 'City']])

Output:

0     Alice
1       Bob
2    Charlie
Name: Name, dtype: object

     Name  Age
0   Alice   25
1     Bob   30
2  Charlie  35

Name       Bob
Age         30
City    London
Name: 1, dtype: object

     Name  Age   City
2  Charlie   35  Paris

     Name   City
2  Charlie  Paris

Slide 4: Data Cleaning and Preprocessing

Data cleaning is crucial in data analysis. Let's explore some common data cleaning operations.

# Handle missing values
df['Salary'] = [50000, None, 75000]
print(df)

# Fill missing values
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print(df)

# Remove duplicates
df = df.drop_duplicates()

# Rename columns
df = df.rename(columns={'Name': 'Full_Name'})
print(df.columns)

Output:

     Name  Age     City  Salary
0   Alice   25  New York  50000.0
1     Bob   30    London      NaN
2  Charlie  35     Paris  75000.0

     Name  Age     City     Salary
0   Alice   25  New York  50000.000
1     Bob   30    London  62500.000
2  Charlie  35     Paris  75000.000

Index(['Full_Name', 'Age', 'City', 'Salary'], dtype='object')

Slide 5: Data Transformation

Pandas provides powerful tools for data transformation. Let's explore some common operations.

# Apply a function to a column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Adult')

# Create a new column based on multiple existing columns
df['Location'] = df['City'] + ', ' + df['Country']

# Categorical encoding
df['City_Code'] = pd.Categorical(df['City']).codes

print(df)

Output:

  Full_Name  Age     City     Salary Age_Group        Location  City_Code
0     Alice   25  New York  50000.000     Young  New York, USA          2
1       Bob   30    London  62500.000     Adult    London, UK          1
2   Charlie   35     Paris  75000.000     Adult  Paris, France         0

Slide 6: Grouping and Aggregation

Grouping and aggregation are essential for data analysis. Let's explore these operations.

# Group by a column and calculate statistics
grouped = df.groupby('Age_Group')
print(grouped['Salary'].mean())

# Multiple aggregations
agg_result = grouped.agg({
    'Salary': ['mean', 'max'],
    'Age': 'mean'
})
print(agg_result)

# Reset index after grouping
agg_result = agg_result.reset_index()
print(agg_result)

Output:

Age_Group
Adult    68750.0
Young    50000.0
Name: Salary, dtype: float64

           Salary            Age
             mean       max  mean
Age_Group                        
Adult     68750.0  75000.0  32.5
Young     50000.0  50000.0  25.0

  Age_Group     Salary            Age
                  mean       max  mean
0     Adult  68750.000  75000.0  32.5
1     Young  50000.000  50000.0  25.0

Slide 7: Merging and Joining DataFrames

Combining data from multiple sources is a common task. Let's explore merging and joining operations.

# Create two DataFrames
df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [2, 3, 4], 'City': ['London', 'Paris', 'Berlin']})

# Inner join
inner_join = pd.merge(df1, df2, on='ID', how='inner')
print("Inner Join:")
print(inner_join)

# Left join
left_join = pd.merge(df1, df2, on='ID', how='left')
print("\nLeft Join:")
print(left_join)

# Concatenate DataFrames
df3 = pd.concat([df1, df2], axis=1)
print("\nConcatenated:")
print(df3)

Output:

Inner Join:
   ID     Name    City
0   2      Bob  London
1   3  Charlie   Paris

Left Join:
   ID     Name    City
0   1    Alice     NaN
1   2      Bob  London
2   3  Charlie   Paris

Concatenated:
   ID     Name  ID    City
0   1    Alice   2  London
1   2      Bob   3   Paris
2   3  Charlie   4  Berlin

Slide 8: Time Series Data

Pandas excels at handling time series data. Let's explore some time series operations.

import pandas as pd

# Create a time series DataFrame
dates = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
ts_df = pd.DataFrame({'Date': dates, 'Value': range(len(dates))})
ts_df.set_index('Date', inplace=True)

print(ts_df)

# Resample to weekly frequency
weekly = ts_df.resample('W').mean()
print("\nWeekly average:")
print(weekly)

# Rolling window calculations
rolling_mean = ts_df.rolling(window=3).mean()
print("\nRolling mean (3-day window):")
print(rolling_mean)

Output:

            Value
Date             
2023-01-01      0
2023-01-02      1
2023-01-03      2
2023-01-04      3
2023-01-05      4
2023-01-06      5
2023-01-07      6
2023-01-08      7
2023-01-09      8
2023-01-10      9

Weekly average:
            Value
Date             
2023-01-01    1.5
2023-01-08    6.5

Rolling mean (3-day window):
                Value
Date                 
2023-01-01       NaN
2023-01-02       NaN
2023-01-03  1.000000
2023-01-04  2.000000
2023-01-05  3.000000
2023-01-06  4.000000
2023-01-07  5.000000
2023-01-08  6.000000
2023-01-09  7.000000
2023-01-10  8.000000

Slide 9: Data Visualization with Pandas

Pandas integrates well with matplotlib for quick data visualization. Let's create some basic plots.

import matplotlib.pyplot as plt

# Create a sample DataFrame
data = {'Year': [2018, 2019, 2020, 2021, 2022],
        'Sales': [100, 120, 90, 150, 180]}
df = pd.DataFrame(data)

# Line plot
df.plot(x='Year', y='Sales', kind='line')
plt.title('Sales Trend')
plt.show()

# Bar plot
df.plot(x='Year', y='Sales', kind='bar')
plt.title('Sales by Year')
plt.show()

# Histogram
df['Sales'].plot(kind='hist')
plt.title('Sales Distribution')
plt.show()

[Note: The actual plots would be displayed here. As I cannot generate or display images, I've described the expected output.]

Slide 10: Advanced Data Analysis: Pivot Tables

Pivot tables are powerful tools for data analysis. Let's create a pivot table to summarize our data.

# Create a sample DataFrame
data = {
    'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'],
    'Product': ['A', 'B', 'A', 'B'],
    'Sales': [100, 150, 120, 180]
}
df = pd.DataFrame(data)
df['Date'] = pd.to_datetime(df['Date'])

# Create a pivot table
pivot = df.pivot_table(values='Sales', index='Date', columns='Product', aggfunc='sum')

print("Pivot Table:")
print(pivot)

# Add row and column totals
pivot['Total'] = pivot.sum(axis=1)
pivot.loc['Total'] = pivot.sum()

print("\nPivot Table with Totals:")
print(pivot)

Output:

Pivot Table:
Product           A      B
Date                      
2023-01-01  100.0  150.0
2023-01-02  120.0  180.0

Pivot Table with Totals:
Product           A      B  Total
Date                           
2023-01-01  100.0  150.0  250.0
2023-01-02  120.0  180.0  300.0
Total       220.0  330.0  550.0

Slide 11: Real-life Example: Weather Data Analysis

Let's analyze weather data using Pandas. We'll work with a dataset containing temperature readings from different cities.

import pandas as pd
import matplotlib.pyplot as plt

# Create a sample weather dataset
data = {
    'Date': pd.date_range(start='2023-01-01', end='2023-01-10'),
    'City': ['New York', 'London', 'Tokyo'] * 3 + ['New York'],
    'Temperature': [5, 2, 8, 6, 3, 9, 4, 1, 7, 5]
}
df = pd.DataFrame(data)

# Calculate average temperature by city
avg_temp = df.groupby('City')['Temperature'].mean()
print("Average Temperature by City:")
print(avg_temp)

# Plot temperature trends
plt.figure(figsize=(10, 6))
for city in df['City'].unique():
    city_data = df[df['City'] == city]
    plt.plot(city_data['Date'], city_data['Temperature'], label=city)

plt.title('Temperature Trends by City')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

Output:

Average Temperature by City:
City
London    2.0
New York  5.0
Tokyo     8.0
Name: Temperature, dtype: float64

[Note: A line plot showing temperature trends for each city would be displayed here.]

Slide 12: Real-life Example: Text Analysis

Let's use Pandas for text analysis on a dataset of book titles.

import pandas as pd

# Create a sample dataset of book titles
data = {
    'Title': [
        'The Great Gatsby',
        'To Kill a Mockingbird',
        '1984',
        'Pride and Prejudice',
        'The Catcher in the Rye'
    ],
    'Author': [
        'F. Scott Fitzgerald',
        'Harper Lee',
        'George Orwell',
        'Jane Austen',
        'J.D. Salinger'
    ],
    'Year': [1925, 1960, 1949, 1813, 1951]
}
df = pd.DataFrame(data)

# Count words in titles
df['Word_Count'] = df['Title'].apply(lambda x: len(x.split()))

# Extract first word of each title
df['First_Word'] = df['Title'].apply(lambda x: x.split()[0])

# Find titles containing 'the' (case-insensitive)
df['Contains_The'] = df['Title'].str.contains('the', case=False)

print(df)

# Analyze word counts
print("\nWord Count Statistics:")
print(df['Word_Count'].describe())

# Most common first words
print("\nMost Common First Words:")
print(df['First_Word'].value_counts())

Output:

                    Title               Author  Year  Word_Count First_Word  Contains_The
0       The Great Gatsby  F. Scott Fitzgerald  1925           3        The          True
1  To Kill a Mockingbird          Harper Lee  1960           4         To         False
2                   1984       George Orwell  1949           1       1984         False
3    Pride and Prejudice         Jane Austen  1813           3      Pride         False
4  The Catcher in the Rye      J.D. Salinger  1951           5        The          True

Word Count Statistics:
count    5.000000
mean     3.200000
std      1.483240
min      1.000000
25%      3.000000
50%      3.000000
75%      4.000000
max      5.000000
Name: Word_Count, dtype: float64

Most Common First Words:
The      2
To       1
1984     1
Pride    1
Name: First_Word, dtype: int64

Slide 13: Advanced Pandas: Custom Aggregation Functions

Pandas allows you to define custom aggregation functions for complex data analysis. Let's explore this feature.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {
    'Category': ['A', 'A', 'B', 'B', 'C'],
    'Value1': [10, 20, 30, 40, 50],
    'Value2': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)

# Define custom aggregation functions
def range_func(x):
    return x.max() - x.min()

def percent_change(x):
    return (x.iloc[-1] - x.iloc[0]) / x.iloc[0] * 100

# Apply custom aggregations
custom_agg = df.groupby('Category').agg({
    'Value1': ['mean', range_func],
    'Value2': ['median', percent_change]
})

print(custom_agg)

Output:

        Value1            Value2         
          mean range_func median percent_change
Category                                       
A          15.0       10.0    1.5         100.0
B          35.0       10.0    3.5          33.3
C          50.0        0.0    5.0           0.0

Slide 14: Working with Large Datasets: Chunking

When dealing with large datasets, processing data in chunks can be more efficient. Let's explore how to use chunking in Pandas.

import pandas as pd

# Simulate reading a large CSV file in chunks
chunk_size = 1000
chunks = []

for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
    # Process each chunk
    processed_chunk = chunk[chunk['value'] > chunk['value'].mean()]
    chunks.append(processed_chunk)

# Combine all processed chunks
result = pd.concat(chunks, ignore_index=True)

print(f"Total rows after processing: {len(result)}")
print(result.head())

Note: This code assumes the existence of a large CSV file. In practice, you would replace 'large_file.csv' with your actual file path.

Slide 15: Additional Resources

For further exploration of Pandas, consider these resources:

Pandas Official Documentation: https://pandas.pydata.org/docs/
"Python for Data Analysis" by Wes McKinney (creator of Pandas)
DataCamp's Pandas Tutorials: https://www.datacamp.com/courses/pandas-foundations
Real Python's Pandas Tutorials: https://realpython.com/learning-paths/pandas-data-science/

These resources offer in-depth explanations, practical examples, and advanced techniques to enhance your Pandas skills.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comprehensive Guide to Python Pandas

Uh oh!

FilesExpand file tree

Comprehensive Guide to Python Pandas.md

Latest commit

History

Comprehensive Guide to Python Pandas.md

File metadata and controls

Comprehensive Guide to Python Pandas