Module 2 Recap — Regression Project

Goal

Predict car MSRP from tabular features (make, model, year, specs, etc.).

Data prep & EDA

  • Cleaning: lowercased column names, replaced spaces with underscores; normalized string values.
  • EDA: MSRP had a long tail → applied log1p(msrp) to stabilize variance and help models.
  • Missing values: identified with isna().sum(); for baseline, filled numeric NAs with 0 (simple, not always ideal).

Validation framework

  • Split into train / val / test (60/20/20).
  • Wrote a reusable prepare_x(df) to:

    • Build numeric features (+ age = 2017 - year).
    • Add chosen categorical one-hots (e.g., top-K per category).
    • Fill NAs consistently.
    • Return NumPy arrays with fixed column order.

Linear regression, from scratch

  • Started with scalar form → vector dot product → full matrix × vector form.
  • Trained via normal equation: $w=(X^\top X)^{-1}X^\top y$.
  • Baseline used only numeric features → OK but underfit.

Metric

  • RMSE on the validation set (computed in log space to match target transform).

Feature engineering

  • Age feature gave a big improvement.
  • Added categoricals with one-hot (e.g., number_of_doors, make, fuel, transmission, driven_wheels, size, style, market_category).
  • First attempt blew up (huge weights/RMSE) → collinearity made $X^\top X$ numerically unstable.

Regularization (Ridge)

  • Fixed instability by adding λ to the diagonal of $X^\top X$: $w=(X^\top X+\lambda D)^{-1}X^\top y$ (no penalty on intercept).
  • Tuned λ on validation (grid of values); several close—picked a small λ that stabilized and slightly improved RMSE.

Final model & test

  • Retrained on train+val with best λ; evaluated once on test → similar RMSE to val (good generalization).
  • Single-car inference: build a one-row DataFrame from a dict → prepare_x → predict log price → expm1 to get dollars.

What to save for serving

  • w0, w, feature order, category vocab (top-K lists), NA-fill rules, and constants (e.g., reference year for age).

Pitfalls you handled

  • Target long tail → log transform.
  • Data leakage → kept test unseen until the end.
  • Feature drift → froze vocab/order in prepare_x.
  • Multicollinearity → ridge regularization.
  • Objective eval → RMSE on validation, then confirm on test.

Next steps

  • Try better missing-value strategies (median/mean per group).
  • Add interactions/nonlinearities or switch to tree-based models (Random Forest, GBMs).
  • Migrate to scikit-learn pipelines for cleaner training/serving.