Building a Baseline Car Price Model
In this lesson we turn the math and plumbing from the previous sessions into a baseline model for predicting car prices. The goal is not “best possible performance” yet—it’s a clean, minimal pipeline that runs end-to-end, so we can (a) verify our data path and (b) establish a benchmark to improve later.
We’ll:
- select a small set of numeric features,
- handle missing values pragmatically,
- train linear regression (on the log-transformed target), and
- make a quick in-sample sanity check of predictions—then tee up proper evaluation with RMSE next.
1) Pick a minimal feature set
For a baseline, we’ll restrict ourselves to straightforward numeric columns. From the dataframe, we’ll use:
engine_hp
(engine horsepower)engine_cylinders
highway_mpg
city_mpg
popularity
These are easy to feed to a linear model and already numeric after our earlier cleaning.
numeric_cols = [
"engine_hp",
"engine_cylinders",
"highway_mpg",
"city_mpg",
"popularity",
]
df_train_base = df_train[numeric_cols].copy()
df_val_base = df_val[numeric_cols].copy()
df_test_base = df_test[numeric_cols].copy()
2) Handle missing values (simple, explicit)
You will likely see NaN
in engine_hp
and engine_cylinders
. A baseline approach is to fill missing values with 0 and move on. It’s not semantically perfect (no car has “0 cylinders”), but it’s fast and often adequate for a first pass.
Why 0 can be “OK enough” here: with linear regression, a 0 in a feature effectively removes its contribution for that row. You’re telling the model “ignore this feature for this sample.” We’ll revisit better imputation later (median/mean, model-based, or adding a missingness indicator).
for part in (df_train_base, df_val_base, df_test_base):
part.fillna(0, inplace=True)
Double-check:
df_train_base.isna().sum()
Everything should be zero.
3) Build X and y (remember: y is log1p(msrp))
We prepared our targets earlier as y_* = log1p(msrp)
. Keep using those for stability with skewed price distributions.
X_train = df_train_base.values # (m_train, 5)
X_val = df_val_base.values # (m_val, 5)
X_test = df_test_base.values # (m_test, 5)
# Already computed earlier when we split and transformed:
# y_train = np.log1p(original_train_msrp)
# y_val = np.log1p(original_val_msrp)
# y_test = np.log1p(original_test_msrp)
4) Train linear regression (normal equation)
We’ll reuse a minimal trainer from the previous lesson that adds the bias column and solves the normal equation.
import numpy as np
def add_bias_column(X):
m = X.shape[0]
return np.hstack([np.ones((m, 1)), X])
def train_linear_regression_normal(X, y):
X_aug = add_bias_column(X)
XtX = X_aug.T @ X_aug
Xty = X_aug.T @ y
w_aug = np.linalg.solve(XtX, Xty) # robust enough for baseline; see notes below
w0, w = w_aug[0], w_aug[1:]
return w0, w
w0, w = train_linear_regression_normal(X_train, y_train)
If
np.linalg.solve
complains about singularity (high collinearity), switch to a more robust solver for a baseline:w_aug = np.linalg.lstsq(add_bias_column(X_train), y_train, rcond=None)[0] w0, w = w_aug[0], w_aug[1:]
5) In-sample predictions (sanity check, not evaluation)
Make predictions on the training data to check for obvious pathologies (e.g., completely flat predictions, dtype mistakes, etc.).
def predict_linear(X, w0, w):
return w0 + X @ w # still in log space
y_train_pred_log = predict_linear(X_train, w0, w)
A quick overlay of the training target vs training predictions (both in log space) can reveal gross mismatches:
import matplotlib.pyplot as plt
plt.hist(y_train_pred_log, bins=50, alpha=0.6, label="pred (log)")
plt.hist(y_train, bins=50, alpha=0.6, label="true (log)")
plt.legend(); plt.title("In-sample check (log space)"); plt.show()
What you may observe for this simple baseline:
- The peak positions of predicted vs true distributions don’t align perfectly.
- Predictions can be shifted lower (systematic underestimation), reflecting that the feature set is small and we haven’t used categorical signals (e.g.,
make
,model
) or better imputations yet.
This is fine—remember: it’s a baseline.
6) Why we don’t trust this chart for performance
An in-sample histogram is only a sanity check. It says nothing about generalization. For real assessment we’ll compute RMSE on the validation split, not the training set, and we’ll be explicit about the space:
- Log space RMSE: compare
y_val
toy_val_pred_log
(model outputs). - Dollar RMSE: compare
expm1(y_val)
toexpm1(y_val_pred_log)
.
We’ll formalize this in the next lesson.
7) Baseline takeaway and next steps
What we now have:
- A minimal numeric-only feature set
- A simple missingness strategy (zeros)
- A trained linear model with intercept
- An in-sample sanity check confirming the pipeline behaves as expected
What’s next:
- Introduce RMSE and evaluate on the validation set.
- Iterate: try median imputation, add categorical encodings (one-hot for
make
,model
,transmission_type
, etc.), and compare RMSE to this baseline. - Add regularization (Ridge) to stabilize weights when we expand features.
This is the benchmark we’ll try to beat.