Validating a Machine Learning Model with a Proper Validation Set (Lesson 10)
In this lesson we stop “peeking” at training performance and evaluate the model the right way—on a separate validation split. We’ll factor our feature prep into a reusable function, retrain on the training split, and report RMSE on the validation split. From here on, this will be the pattern for every improvement.
Why validation (not training) error?
Training error almost always looks optimistic because the model has already “seen” those rows. To estimate generalization, we:
- Train on
df_train
. - Validate on
df_val
(never used during training). - Keep
df_test
untouched until the very end.
Step 1 — A reusable feature-prep function
Two lessons ago we hardcoded our feature prep inline. Let’s encapsulate it so we can apply the same transformation to train/val/test without drift.
import numpy as np
NUMERIC_COLS = [
"engine_hp",
"engine_cylinders",
"highway_mpg",
"city_mpg",
"popularity",
]
def prepare_X(df):
"""
Select baseline numeric features and make a clean design matrix X.
Applies the *same* steps for train/val/test to avoid leakage or drift.
"""
df_num = df[NUMERIC_COLS].copy()
df_num.fillna(0, inplace=True) # baseline imputation
return df_num.values # (m, n)
Key idea: one function, one definition of “X.” No silent differences between splits.
Step 2 — Training: normal equation (as before)
We’ll reuse the minimal trainer that adds a bias column and solves the normal equations. (Any equivalent solver—lstsq
, pinv
—is fine.)
def add_bias_column(X):
m = X.shape[0]
return np.hstack([np.ones((m, 1)), X])
def train_linear_regression_normal(X, y):
X_aug = add_bias_column(X)
XtX = X_aug.T @ X_aug
Xty = X_aug.T @ y
w_aug = np.linalg.solve(XtX, Xty) # switch to lstsq/pinv if singular
w0, w = w_aug[0], w_aug[1:]
return w0, w
def predict_linear(X, w0, w):
return w0 + X @ w # predictions in *log* space
Step 3 — Metric: RMSE (unchanged)
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_pred - y_true) ** 2))
Remember: our targets are log-transformed (y = log1p(msrp)
), so predictions are in log space too. You may compute RMSE in log space or invert both with expm1
and compute dollar RMSE—just be consistent.
Step 4 — Train on train, evaluate on validation
# Build design matrices
X_train = prepare_X(df_train)
X_val = prepare_X(df_val)
# y_* were created earlier as log1p(msrp)
w0, w = train_linear_regression_normal(X_train, y_train)
# Predict on *validation*
y_val_pred_log = predict_linear(X_val, w0, w)
# RMSE in log space
val_rmse_log = rmse(y_val, y_val_pred_log)
# (Optional) RMSE in dollars
y_val_pred = np.expm1(y_val_pred_log)
y_val_true = np.expm1(y_val)
val_rmse_dollars = rmse(y_val_true, y_val_pred)
print("Validation RMSE (log):", val_rmse_log)
print("Validation RMSE ($): ", val_rmse_dollars)
Now you have a single number that reflects generalization quality for this baseline.
Common pitfalls (avoid these)
- Different prep between splits. Always call the same
prepare_X
for train/val/test. - Using training rows for RMSE. Only compute the reported metric on validation (test only once at the very end).
- Mixing spaces. Don’t compare log predictions to dollar targets (or vice versa). Either both log or both dollars.
- Data leakage. Never compute imputations/statistics on the full dataset. (For this baseline we used a fixed value—0—so no fit-time statistics were leaked. When you switch to mean/median, compute them on train and apply to val/test.)
Quick checklist
- Split: train / val / (test later)
- Single source of truth for feature prep:
prepare_X
- Train on train only
- Predict on val only
- Report RMSE (log and/or $), clearly labeled
What’s next
With a trustworthy validation RMSE, we can begin improving the model and keeping only changes that lower that number:
- Better imputation (e.g., train median + missing-indicator features).
- Add categorical features (one-hot for
make
,model
,transmission_type
, etc.). - Regularization (Ridge) to stabilize weights as feature space grows.
Each change → retrain on train → recompute validation RMSE with the same prepare_X
logic. Keep the improvement if the metric moves in the right direction.