Train the final model and use it to price a single car
Alright, we picked our best regularization strength (λ) on the validation set. Now we’ll (1) retrain once on all the data we’re allowed to learn from (train + validation), (2) evaluate on the held-out test set, and (3) show how to price an individual car from a plain Python dict.
1) Why retrain on train + validation?
During tuning we only fit on the train split to keep validation “clean” for selection. Once λ is chosen, we can safely fold validation back in to squeeze a bit more signal out of the data. The test set stays untouched until the very end for a single, honest report.
2) Build the full training set
Concatenate train and validation dataframes:
df_full_train = pd.concat([df_train, df_val], axis=0).reset_index(drop=True)
Targets: If you already have
y_train
andy_val
asnp.log1p(msrp)
, just concatenate:y_full_train = np.concatenate([y_train, y_val])
Tip: keep the same feature prep pipeline you used before. Don’t silently change anything between splits.
3) Prepare features with the same vocabulary
Your prepare_x(df)
should do exactly what it did during tuning:
- numeric features (+ the car age feature),
- chosen categorical one-hots (e.g., top-K per category),
- fill NA (you used zeros earlier),
- keep column order consistent.
If your prepare_x
currently discovers top-K categories from whatever df you pass, freeze that list from training data and reuse it for validation/test:
# One-time, from df_full_train
categories = build_category_vocab(df_full_train, k=5) # dict: feature -> list of top values
X_full = prepare_x(df_full_train, categories) # uses the frozen vocab
Prepare the test matrix with the same vocab:
X_test = prepare_x(df_test, categories)
If a test value isn’t in the top-K, all its one-hots will be 0 for that feature group (or map to an explicit “other” flag if you added one).
4) Train Ridge (regularized linear regression) with the chosen λ
We’ll use the normal equation with diagonal shrinkage (Ridge) and no penalty on the intercept (i.e., the first diagonal element is 0, others are 1):
w0, w = train_linear_regression_reg(X_full, y_full_train, r=best_lambda)
Where train_linear_regression_reg
implements:
with $D=\text{diag}(0,1,1,\dots,1)$.
5) Final evaluation on the test set
Predict on X_test, compute RMSE in log space (same as you tuned):
y_pred_val = X_test.dot(w) + w0
rmse_log = rmse(y_test, y_pred_val)
Optionally also report RMSE in dollars (invert the log1p):
rmse_dollars = np.sqrt(np.mean((np.expm1(y_test) - np.expm1(y_pred_val))**2))
You should see test RMSE close to your validation RMSE (small drift is normal). If it’s much worse, suspect data leakage or a mismatch in feature prep.
6) Use the model to price a single car
Imagine your app posts a JSON payload; you’ll get a plain dict like:
car = {
"make": "toyota",
"model": "sienna",
"year": 2016,
"engine_hp": 266,
"engine_cylinders": 6,
"transmission_type": "automatic",
"driven_wheels": "front_wheel_drive",
"number_of_doors": 4,
"market_category": "minivan,performance",
"vehicle_size": "midsize",
"vehicle_style": "van",
"highway_mpg": 25,
"city_mpg": 18,
"popularity": 2031
}
Turn it into a 1-row dataframe, prepare features, predict log-price, and convert back:
df_one = pd.DataFrame([car])
X_one = prepare_x(df_one, categories) # same vocab!
y_log = X_one.dot(w) + w0
price = float(np.expm1(y_log)) # dollars
# Optional: nice formatting
price_rounded = int(np.round(price, -2)) # e.g., nearest $100
To sanity-check, compare against the true MSRP for that row in your test set; being off by a few thousand dollars is typical for a simple linear baseline.
7) Save everything you need to serve the model
When you deploy, you must recreate the exact feature vector:
w0
(intercept) andw
(weights)- the feature order your model expects
- the category vocabulary (
categories
dict) - any constants used in feature engineering (e.g., the reference year
2017
forage
) - NA fill rules (you used zeros)
A simple way:
np.savez("car_price_model.npz", w0=w0, w=w)
with open("car_price_features.json", "w") as f:
json.dump({"categories": categories,
"numeric_features": base_numeric_feature_list,
"reference_year": 2017}, f)
At inference time, load artifacts, run prepare_x
with the frozen vocab, then price = expm1(X.dot(w)+w0)
.
8) Quick diagnostics (nice to do once)
- Histogram of
y_pred
vsy_test
(in log space) — shapes should be similar. - Residuals vs. prediction — look for patterns (curvature hints at missing nonlinearity).
- Segmented error — e.g., RMSE by vehicle size or age decile to spot pockets of weakness.
What “good” looks like here
- Test RMSE close to validation RMSE → generalization looks healthy.
- Individual prediction within a few thousand dollars → solid for a linear baseline.
- Clean, reproducible pipeline → easy to iterate (add features, try different K for top categories, etc.).
That’s it: you’ve trained the final Ridge model on full train data, verified it on test, and wired it to predict a single car’s price. Next natural steps would be feature upgrades (e.g., richer categorical coverage, interaction terms) or trying tree-based models to capture nonlinearity.