Introduction to Pandas

These notes are based on the video ML Zoomcamp 1.9 - Introduction to Pandas

Pandas is a Python library for manipulating tabular data. It provides powerful data structures and operations for working with structured data efficiently.

Importing Pandas

import pandas as pd  # Standard convention is to use 'pd' as the alias

DataFrames

The main data structure in pandas is the DataFrame, which represents tabular data (similar to a spreadsheet or SQL table).

Creating DataFrames

There are multiple ways to create a DataFrame:

From a list of lists (rows): ```python data = [ [‘Toyota’, ‘Corolla’, 2015, 130, 4, ‘manual’, ‘sedan’, 18000], [‘Ford’, ‘Focus’, 2018, 140, 4, ‘automatic’, ‘sedan’, 23000], [‘Nissan’, ‘Sentra’, 2020, 149, 4, ‘manual’, ‘sedan’, 22000], [‘Nissan’, ‘Altima’, 2014, None, 6, ‘automatic’, ‘coupe’, 16000], [‘Toyota’, ‘Camry’, 2019, 160, 4, ‘automatic’, ‘sedan’, 25000] ] columns = [‘make’, ‘model’, ‘year’, ‘engine_hp’, ‘engine_cylinders’, ‘transmission_type’, ‘vehicle_style’, ‘msrp’]

df = pd.DataFrame(data, columns=columns)

2. **From a list of dictionaries**:
```python
data_dict = [
    {'make': 'Toyota', 'model': 'Corolla', 'year': 2015, 'msrp': 18000},
    {'make': 'Ford', 'model': 'Focus', 'year': 2018, 'msrp': 23000},
    # ... more dictionaries for each row
]

df = pd.DataFrame(data_dict)

Inspecting DataFrames

After creating or loading a DataFrame, it’s common to inspect the first few rows:

df.head()       # Returns first 5 rows by default
df.head(2)      # Returns first 2 rows

Series

Each column in a DataFrame is a Series object. A Series is a one-dimensional labeled array.

Accessing Series (Columns)

There are two ways to access columns:

Dot notation (only works for column names without spaces or special characters):
```
df.make          # Returns the 'make' column as a Series
```

Bracket notation (works for all column names):

df['make']       # Returns the 'make' column as a Series
df['vehicle_style']  # For column names with spaces, must use brackets

Accessing Multiple Columns

To get a subset of columns:

df[['make', 'model', 'msrp']]  # Returns a DataFrame with only these columns

Adding and Modifying Columns

# Add a new column
df['id'] = [1, 2, 3, 4, 5]

# Modify an existing column
df['id'] = [10, 20, 30, 40, 50]

# Delete a column
del df['id']

Indexing

DataFrames have an index that labels each row. By default, it’s a sequential range starting from 0.

df.index  # View the current index

Accessing Rows by Index

Using loc - label-based indexing:

df.loc[1]        # Returns row with index 1
df.loc[[1, 2]]   # Returns rows with indices 1 and 2

Using iloc - position-based indexing:

df.iloc[1]       # Returns row at position 1 (second row)
df.iloc[[1, 2]]  # Returns rows at positions 1 and 2

Changing the Index

# Set a custom index
df.index = ['a', 'b', 'c', 'd', 'e']

# Reset the index to default sequential numbers
df = df.reset_index()  # Keeps old index as a column
df = df.reset_index(drop=True)  # Discards old index

Element-wise Operations

Like NumPy arrays, pandas Series support element-wise operations:

# Divide all values in a column by 100
df['engine_hp'] / 100

# Multiply by 2
df['engine_hp'] * 2

Filtering

You can filter rows based on conditions:

# Filter cars made after 2015
df[df['year'] > 2015]

# Filter Nissan cars
df[df['make'] == 'Nissan']

# Combine conditions (AND)
df[(df['make'] == 'Nissan') & (df['year'] > 2015)]

String Operations

Pandas provides string methods through the .str accessor:

# Convert to lowercase
df['vehicle_style'].str.lower()

# Replace spaces with underscores
df['vehicle_style'].str.replace(' ', '_')

# Chain string operations
df['vehicle_style'] = df['vehicle_style'].str.replace(' ', '_').str.lower()

Summarizing Operations

Pandas offers various methods to summarize data:

# Basic statistics for a column
df['msrp'].mean()
df['msrp'].max()
df['msrp'].min()

# Comprehensive summary statistics
df['msrp'].describe()

# Summary statistics for all numeric columns
df.describe()
df.describe().round(2)  # Round to 2 decimal places for readability

For categorical (string) columns:

# Count unique values
df['make'].nunique()  # Returns number of unique makes

# Count unique values for all columns
df.nunique()

Handling Missing Values

Pandas represents missing values as NaN (Not a Number):

# Identify missing values (returns boolean DataFrame)
df.isnull()

# Count missing values per column
df.isnull().sum()

Grouping

Similar to SQL’s GROUP BY, pandas allows grouping and aggregation:

# Group by transmission type and calculate mean price
df.groupby('transmission_type')['msrp'].mean()

# Calculate min and max prices by transmission type
df.groupby('transmission_type')['msrp'].min()
df.groupby('transmission_type')['msrp'].max()

Converting Between Pandas and NumPy/Python

# Get underlying NumPy array from a Series
numpy_array = df['msrp'].values

# Convert DataFrame to list of dictionaries
records = df.to_dict(orient='records')

This format is useful for saving data or passing it to other systems that expect Python dictionaries.