Lesson 14 — Finding the best regularization strength for Ridge regression
Why we’re doing this
Last time we added regularization (Ridge) to tame exploding weights from one-hot features. Ridge introduces a single knob, the regularization strength λ (your r
). Turn it too low and the model stays wobbly; too high and it becomes a flat, underfitting pancake. Today we’ll pick λ systematically using the validation set.
What “best” means here
We’ll define “best” as the value of λ that minimizes validation RMSE (the same RMSE we’ve been using on the log-price target). That gives us a number we can defend: the model that generalizes best to unseen-but-similar data.
Tiny reminder: our target is
log1p(price)
. RMSE is therefore measured in log-dollars. That’s fine for model selection. If you want intuition in dollars later, you can convert predictions back withexpm1
and compute RMSE there as well (that’s essentially RMSLE vs. RMSE on original scale trade-offs).
The search plan (simple, solid, and fast)
Fix the data pipeline.
- Use the same
prepare_x(df)
for both train and validation. - Build category lists (top-K values per categorical feature) from train only, reuse on validation.
- Keep the same column order for X across splits.
- Use the same
Choose a grid of λ values on a logarithmic scale. For example:
\[\lambda \in \{0, 10^{-6}, 10^{-5}, 10^{-4}, 10^{-3}, 10^{-2}, 10^{-1}, 1, 10\}.\]0
gives you the non-regularized baseline.- The rest span several orders of magnitude so we don’t miss the sweet spot.
For each λ in the grid:
- Train: $(X_{\text{train}}^\top X_{\text{train}} + \lambda D)^{-1} X_{\text{train}}^\top y_{\text{train}}$ (with no penalty on the intercept; i.e., the first diagonal entry in $D$ is 0, the rest are 1s).
- Predict on validation: $\hat{y}{\text{val}} = X{\text{val}} w$.
- Compute RMSE on validation.
Pick λ with the lowest validation RMSE.
- If multiple λ’s tie within noise, prefer the larger one (slightly more shrinkage = safer).
Retrain one last time with the chosen λ on the train set (we’ll fold val in later when we finalize).
Reading the output you’ll print
You’ll likely print a row per λ, something like:
lambda=0 bias=HUGE rmse=HIGH
lambda=1e-6 bias=reasonable rmse=↓
lambda=1e-4 bias=stable rmse=↓ a bit more
lambda=1e-2 bias=stable rmse=best (or near-best)
lambda=1 bias=shrunk rmse=↑ (underfitting)
lambda=10 bias=tiny rmse=↑↑
The typical pattern: dramatic improvement from 0 → small λ, a broad flat minimum somewhere around $10^{-3}$–$10^{-1}$, then degradation as λ gets large.
If your printout suggests λ≈0.01 performs best (as you saw), that’s a perfectly reasonable choice.
Guardrails and gotchas
- Dummy variable trap: keep an intercept or all dummies minus one per category, not both. (Ridge helps numerically but can’t fix a logical redundancy.)
- Feature consistency:
prepare_x
must build the same columns for train and validation. Lock in the category list from train; unseen categories in validation should simply produce all-zeros across the corresponding one-hots (or map to an “other” flag if you added one). - Randomness: If shuffling affects your split, set a random seed so the search is reproducible while you iterate.
- Metric scale: Remember your RMSE is on the log target; that’s intentional. Don’t mix scales mid-experiment.
Optional polish (nice to have)
- Plot RMSE vs. log10(λ). It’s easier to see the U-shape and pick the stable zone.
- K-fold cross-validation on train instead of a single validation split if your dataset is small or you want a more robust pick.
- Nested CV if you’re publishing results or need strict separation between model selection and performance estimation. For a practical project, the train/val/test scheme is usually sufficient.
What happens next
You now have the λ that behaves best on validation. In the next lesson, we’ll do the final sanity check:
- Retrain with that λ (often on train + validation to use all the signal).
- Evaluate once on the held-out test set to report the final RMSE.
- Convert a few predictions back to dollars to sanity-check they make sense.
That’s the full loop: regularize → tune λ → lock it in → test.