Car Price Prediction Project – Introduction and Plan
In this session, we’ll begin a hands-on project where we will implement a car price prediction model from start to finish.
In the project scenario, we imagine an online classified ads website where users can sell their cars. Sellers need to enter a price for their car listing, but often they don’t know the optimal value to choose. Our goal is to build a model that can assist them by predicting a fair and competitive price based on the characteristics of the car they are selling.
The idea is straightforward: a user enters details about the car — such as the make, model, year, engine type, fuel type, and other features — and our model suggests a price. By the end of this project, we will have a working predictive system that could be integrated into such a website.
The Dataset
To train our model, we’ll use a dataset from Kaggle that contains real car listings along with their prices. Each record in the dataset represents a single car and includes multiple features:
- Make (manufacturer) and Model
- Year of manufacture
- Engine details (size, type, number of cylinders)
- Fuel type
- Transmission type
- And most importantly, MSRP — the Manufacturer’s Suggested Retail Price, which will be our target variable.
MSRP is essentially the baseline price recommended by the manufacturer for a new car. In our dataset, it may reflect new or used prices depending on the listing. This is the value we aim to predict using the other features as input.
Even at a first glance, the dataset contains mixed data types:
- Numeric features (e.g., engine size, horsepower)
- Categorical features (e.g., make, fuel type)
- And the price, which is numeric and continuous.
Project Plan
We will structure the project in several clear steps:
Step 1 – Data Collection and Initial Exploration
We’ll begin by loading the dataset and examining it to understand what’s inside. This is called Exploratory Data Analysis (EDA). We will:
- View the first few rows to get an overview of the features.
- Check the data types of each column.
- Identify missing values or inconsistencies.
- Note any unusual formatting (for example, uppercase vs lowercase in category names).
Step 2 – Data Preparation
Before training any models, we need to make sure the data is clean and consistent. This may include:
- Standardizing column names (e.g., making them lowercase and replacing spaces with underscores).
- Cleaning categorical values so they follow a consistent format.
- Converting columns to the right data types.
- Handling missing values appropriately.
Step 3 – Baseline Model with Linear Regression
Our first predictive model will be linear regression. We will train it using the cleaned dataset to predict car prices. This will give us a baseline for model performance.
Step 4 – Understanding Linear Regression Internals
We won’t just use a library function — we will also look under the hood and implement the key steps of linear regression manually. This will help solidify our understanding of how the model works mathematically.
Step 5 – Model Evaluation
To measure the accuracy of our predictions, we will use RMSE (Root Mean Squared Error). RMSE is a common metric for regression tasks that penalizes large errors more heavily. We’ll learn how to compute it and interpret the results.
Step 6 – Feature Engineering
We’ll explore ways to improve our model by creating new features from existing ones — for example, converting a date into an “age of the car” feature or grouping rare categories together. This step is crucial for boosting predictive performance.
Step 7 – Addressing Numerical Stability Issues
As we experiment, we might encounter numerical stability problems in our model calculations. We’ll discuss regularization, a technique that helps control model complexity and improve generalization.
Step 8 – Final Model and Wrap-Up
We’ll integrate all improvements into a final version of our price prediction model, ready for potential deployment.
Code and Resources
All the code we use in this session is available in the GitHub repository for the book accompanying this course:
- Repository:
mlbookcamp-code
- Location:
chapter-02/car-price
Inside this folder, you’ll find:
- A Jupyter notebook with all the code we will write during the session.
- The CSV file containing the car price dataset we’ll use for training.
You can follow along live, or review the notebook later to replicate the steps on your own.
What’s Next
In the next lesson, we’ll start working directly with the dataset. Our first task will be data preparation:
- Loading the CSV file into Pandas.
- Cleaning and normalizing column names and categorical values.
- Ensuring the dataset is ready for analysis and modeling.
Once the data is in good shape, we’ll move on to EDA and our first linear regression model.