Adding Car Age to Improve the Baseline
Goal: enrich our baseline numeric-only model by adding a simple but high-signal feature: car age. We’ll (1) motivate age vs. year, (2) compute age safely, (3) integrate it into a reusable prepare_X
function without mutating inputs, and (4) retrain/evaluate on the validation split. You’ll see that a single, well-chosen transformation can materially reduce error.
Why use age instead of year?
Modeling with year
encodes a direction (“larger number = newer”), but linear models respond more naturally to a quantity that grows with depreciation: age = current_year − year. Age tends to correlate linearly with log-price for much of the range (with exceptions for classics), which makes it a strong baseline signal.
- Newer car → smaller age → typically higher price.
- Older car → larger age → typically lower price.
In our dataset, assume data were collected in 2017, so we’ll use reference_year = 2017
.
Implementation details (and pitfalls to avoid)
- Compute
age = reference_year - year
. - Clamp negative ages (future model years) to 0.
- Don’t mutate the caller’s dataframe inside
prepare_X
; make a copy. - Keep the exact same transformation for train/val/test to avoid leakage or drift.
Updated feature prep: add age
We’ll extend our baseline numeric features:
NUMERIC_COLS = [
"engine_hp",
"engine_cylinders",
"highway_mpg",
"city_mpg",
"popularity",
]
…and add age
:
import numpy as np
REFERENCE_YEAR = 2017 # from dataset context; parameterize if needed
def prepare_X(df, reference_year=REFERENCE_YEAR):
"""
Build the design matrix X with baseline numeric features + computed 'age'.
Does not mutate the input dataframe.
"""
d = df.copy() # avoid side effects
# Compute age and clamp to [0, ∞)
d["age"] = np.maximum(0, reference_year - d["year"])
# Select features (baseline + age)
features = NUMERIC_COLS + ["age"]
d = d[features].copy()
# Baseline imputation: fill numerics with 0
# (Simple but consistent across splits; we can improve later.)
d.fillna(0, inplace=True)
return d.values # (m, n_features)
Sanity check: With
age
included, your feature count should increase from 5 to 6.
Train on train, evaluate on validation
(Using the linear-regression code from earlier lessons.)
def add_bias_column(X):
m = X.shape[0]
return np.hstack([np.ones((m, 1)), X])
def train_linear_regression_normal(X, y):
X_aug = add_bias_column(X)
XtX = X_aug.T @ X_aug
Xty = X_aug.T @ y
# If singular, switch to np.linalg.lstsq or add ridge.
w_aug = np.linalg.solve(XtX, Xty)
w0, w = w_aug[0], w_aug[1:]
return w0, w
def predict_linear(X, w0, w):
return w0 + X @ w
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_pred - y_true) ** 2))
# Build matrices
X_train = prepare_X(df_train)
X_val = prepare_X(df_val)
# y_* are in log space: y = log1p(msrp)
w0, w = train_linear_regression_normal(X_train, y_train)
# Validate
y_val_pred_log = predict_linear(X_val, w0, w)
val_rmse_log = rmse(y_val, y_val_pred_log)
# (Optional) report in dollars
y_val_pred = np.expm1(y_val_pred_log)
y_val_true = np.expm1(y_val)
val_rmse_$ = rmse(y_val_true, y_val_pred)
print("Validation RMSE (log):", val_rmse_log)
print("Validation RMSE ($): ", val_rmse_$)
What to expect: In practice, adding age
often yields a clear improvement. In the walkthrough, validation RMSE (log) dropped from roughly 0.76 → 0.51—a large gain for a single engineered feature. Your exact numbers will vary by split and cleaning, but directionally this is common.
Quick visual: predictions vs. truth (validation)
A simple overlay helps confirm we’re moving in the right direction:
import matplotlib.pyplot as plt
plt.hist(y_val_pred_log, bins=50, alpha=0.6, label="pred (log)")
plt.hist(y_val, bins=50, alpha=0.6, label="true (log)")
plt.legend(); plt.title("Validation — log targets vs. predictions");
plt.show()
You should see the peaks align better versus the numeric-only baseline. There will still be regions where the model underfits (e.g., extreme tails). That’s expected at this stage.
Notes & extensions
- Parameterize
reference_year
: if you’re unsure, use the dataset’s collection year ordf["year"].max()
(but fix it based on train to avoid peeking). - Nonlinearity: price-age relationships aren’t perfectly linear. Later, try piecewise or polynomial terms (e.g.,
age
,age^2
) and regularize. - Imputation: replace the 0-fill with train medians and add missingness indicators (e.g.,
is_engine_hp_missing
). - Regularization: when you expand features, switch to Ridge to stabilize weights.
Where we are, what’s next
We upgraded our baseline with a single, information-dense feature while keeping the pipeline clean and reproducible:
prepare_X
does not mutate inputs,- consistent transforms across train/val/test,
- measurable improvement in validation RMSE.
Next up: incorporate categorical variables (e.g., make
, model
, transmission_type
) via one-hot encoding and validate the gains.