Preparing the Dataset for the Car Price Prediction Project
In this lesson, we get practical: we’ll locate the dataset, download it, load it into Pandas, and normalize names and text values so everything is easy to work with in the next steps of the project.
1) Where the data lives and how to download it
We’ll use the copy of the dataset stored in the course GitHub repository (the folder for the “car price” chapter). Inside you’ll find two things we care about:
- A Jupyter notebook with the lesson’s code
- A CSV file with the data
Download the CSV via either:
- Your browser (“Save as…” from the raw file view), or
- The terminal with
wget <direct-file-url>
If you already have the file locally, you’ll see the usual OS behavior (e.g., appending
(1)
to avoid overwriting). That’s fine.
2) Load the dataset into Pandas
We’ll load the CSV and immediately peek at the first rows to sanity-check what we got:
import pandas as pd
df = pd.read_csv("path/to/your.csv") # use the actual filename from the repo
df.head()
You should see columns like make, model, year, engine and fuel characteristics, transmission information, etc. The target we want to predict is MSRP
— Manufacturer’s Suggested Retail Price.
3) Why we normalize names (and what “Index” has to do with it)
You’ll notice some inconsistencies in column names:
- A mix of upper and lower case
- Spaces in some names (e.g.,
"Transmission Type"
) - Occasional underscores elsewhere
This matters because:
- Dot notation (
df.Transmission Type
) doesn’t work with spaces. - Inconsistent casing and separators make code brittle and harder to search/grep.
In Pandas, df.columns
is an Index (a label container similar to a Series
) that supports vectorized string ops through .str
. We’ll use that to apply the same cleanup everywhere at once.
Normalize column names:
df.columns = (
df.columns
.str.lower()
.str.replace(' ', '_', regex=False)
)
Now all column names are lowercase_with_underscores, which is predictable and safe to use in code.
4) Normalize text values (categorical columns)
The same inconsistency often appears in values inside string/categorical columns:
- Some makes may be
"BMW"
while others are"Bmw"
or"bmw"
. - Spaces vs underscores, etc.
First, identify which columns are text. When CSVs are read, text columns usually appear with dtype object
.
df.dtypes # quick scan of types
Extract just the string columns. Two equivalent approaches:
A. With select_dtypes
(cleanest):
string_cols = df.select_dtypes(include=['object']).columns.tolist()
B. With a dtype comparison (matches the source’s approach):
string_cols = df.dtypes[df.dtypes == 'object'].index.tolist()
Now apply a consistent transformation to every string column:
for col in string_cols:
df[col] = (
df[col]
.str.lower()
.str.replace(' ', '_', regex=False)
)
At this point:
- All column names are lowercase_with_underscores.
- All string values are normalized the same way.
This makes downstream encoding, grouping, and analysis much less error-prone.
5) Quick verification
It’s worth re-checking the top rows and a couple of columns after cleaning:
df.head()
df[string_cols].nunique().sort_values(ascending=False).head(10)
head()
confirms the text normalization looks right.nunique()
helps spot suspicious duplicates caused by casing/spacing before cleaning (e.g.,"front_wheel_drive"
vs"Front Wheel Drive"
now unified).
6) Notes and common pitfalls
- Dot vs bracket access: After normalization you can use
df.transmission_type
, but preferdf['transmission_type']
in production notebooks to avoid clashes with DataFrame attributes. Non-string objects:
object
columns can sometimes hold non-string values (e.g., mixed types orNaN
). If you hit errors like “Can only use .str accessor with string values”, ensure you have strings:df[col] = df[col].astype('string').str.lower().str.replace(' ', '_', regex=False)
- Underscore collisions: Replacing spaces with underscores can produce identical labels if the raw data had near-duplicates. Use
nunique()
andvalue_counts()
to audit high-cardinality columns.
7) What’s next
With a clean, consistent DataFrame, we’re set up for the next steps:
Exploratory Data Analysis (EDA)
- Inspect distributions (e.g., MSRP, horsepower, engine size)
- Check missing values and outliers
- Understand relationships between features and price
Train a baseline model (Linear Regression)
- Split into train/validation sets
- Fit the model and compute RMSE
- Establish a benchmark for later improvements
Feature engineering
- Derive useful features (e.g., car age from year)
- Consolidate rare categories
We’ll start with EDA and a careful train/validation split in the next lesson.