Module 2 Recap — Regression Project

Goal

Predict car MSRP from tabular features (make, model, year, specs, etc.).

Data prep & EDA

Cleaning: lowercased column names, replaced spaces with underscores; normalized string values.
EDA: MSRP had a long tail → applied log1p(msrp) to stabilize variance and help models.
Missing values: identified with isna().sum(); for baseline, filled numeric NAs with 0 (simple, not always ideal).

Validation framework

Split into train / val / test (60/20/20).
Wrote a reusable prepare_x(df) to:
- Build numeric features (+ age = 2017 - year).
- Add chosen categorical one-hots (e.g., top-K per category).
- Fill NAs consistently.
- Return NumPy arrays with fixed column order.

Linear regression, from scratch

Started with scalar form → vector dot product → full matrix × vector form.
Trained via normal equation: $w=(X^\top X)^{-1}X^\top y$.
Baseline used only numeric features → OK but underfit.

Metric

RMSE on the validation set (computed in log space to match target transform).

Feature engineering

Age feature gave a big improvement.
Added categoricals with one-hot (e.g., number_of_doors, make, fuel, transmission, driven_wheels, size, style, market_category).
First attempt blew up (huge weights/RMSE) → collinearity made $X^\top X$ numerically unstable.

Regularization (Ridge)

Fixed instability by adding λ to the diagonal of $X^\top X$: $w=(X^\top X+\lambda D)^{-1}X^\top y$ (no penalty on intercept).
Tuned λ on validation (grid of values); several close—picked a small λ that stabilized and slightly improved RMSE.

Final model & test

Retrained on train+val with best λ; evaluated once on test → similar RMSE to val (good generalization).
Single-car inference: build a one-row DataFrame from a dict → prepare_x → predict log price → expm1 to get dollars.

What to save for serving

w0, w, feature order, category vocab (top-K lists), NA-fill rules, and constants (e.g., reference year for age).

Pitfalls you handled

Target long tail → log transform.
Data leakage → kept test unseen until the end.
Feature drift → froze vocab/order in prepare_x.
Multicollinearity → ridge regularization.
Objective eval → RMSE on validation, then confirm on test.

Next steps

Try better missing-value strategies (median/mean per group).
Add interactions/nonlinearities or switch to tree-based models (Random Forest, GBMs).
Migrate to scikit-learn pipelines for cleaner training/serving.