Training the Model (Normal Equation)
In the last lesson we learned how to apply a linear model once the weights are known. Today we answer the question: how do we find those weights from data?
We’ll start from the model in vector/matrix form, explain why the “naïve inverse” doesn’t work, derive the normal equation, handle the bias (intercept) correctly, and implement a clean NumPy training routine. We’ll finish with practical notes on stability and what to do when things go wrong.
1) Recap and objective
Our model predicts the target (here, log1p(msrp)
) with a linear function of features:
- $\mathbf{X} \in \mathbb{R}^{m \times (n+1)}$ is the design matrix (features for $m$ examples). The first column is all ones for the intercept, followed by $n$ feature columns.
- $\tilde{\mathbf{w}} \in \mathbb{R}^{(n+1)}$ stacks the bias $w_0$ and the $n$ weights $w_1,\dots,w_n$.
- $\hat{\mathbf{y}}$ are predictions in log space.
Goal: choose $\tilde{\mathbf{w}}$ so that predictions are as close as possible to the true targets $\mathbf{y}$ on the training set.
2) Why we don’t “just invert X”
If we (wrongly) treated $\mathbf{X}\,\tilde{\mathbf{w}}=\mathbf{y}$ as a square linear system, we might try $\tilde{\mathbf{w}}=\mathbf{X}^{-1}\mathbf{y}$. But $\mathbf{X}$ is almost always rectangular (many rows, fewer columns), so $\mathbf{X}^{-1}$ doesn’t exist.
What we actually want is the closest solution in the least-squares sense:
\[\min_{\tilde{\mathbf{w}}} \ \|\mathbf{X}\,\tilde{\mathbf{w}} - \mathbf{y}\|_2^2\]This gives the classic normal equation.
3) Normal equation (least squares solution)
Multiply both sides of $\mathbf{X}\,\tilde{\mathbf{w}} \approx \mathbf{y}$ by $\mathbf{X}^\top$:
\[\mathbf{X}^\top \mathbf{X}\,\tilde{\mathbf{w}} \;=\; \mathbf{X}^\top \mathbf{y}\]If $\mathbf{X}^\top \mathbf{X}$ is invertible (it’s square $(n{+}1)\times(n{+}1)$), the minimizer is
\[\boxed{\;\tilde{\mathbf{w}} \;=\; (\mathbf{X}^\top \mathbf{X})^{-1}\,\mathbf{X}^\top \mathbf{y}\;}\]This $\tilde{\mathbf{w}}$ is the least-squares solution (proofs in standard texts such as The Elements of Statistical Learning).
Practical note: even when invertible, don’t literally invert matrices in code. Solve the system instead (numerically more stable).
4) Handling the intercept (bias) cleanly
You have two equivalent options:
- Augment $\mathbf{X}$ with a leading column of ones and learn $w_0$ as part of $\tilde{\mathbf{w}}$ (what we’ll do here), or
- Let a library handle the intercept (e.g., scikit-learn’s
fit_intercept=True
).
Never do both.
5) Minimal NumPy implementation (normal equation)
We’ll write a small trainer that:
- adds the bias column,
- builds the Gram matrix $G=\mathbf{X}^\top\mathbf{X}$,
- solves $G\,\tilde{\mathbf{w}}=\mathbf{X}^\top\mathbf{y}$.
import numpy as np
def add_bias_column(X):
"""Add an intercept column of ones to feature matrix X (m, n) -> (m, n+1)."""
m = X.shape[0]
return np.hstack([np.ones((m, 1)), X])
def train_linear_regression_normal(X, y):
"""
Train linear regression via the normal equation.
X : (m, n) feature matrix (no bias column)
y : (m,) target vector (e.g., log1p(msrp))
Returns:
w0 : float bias (intercept)
w : (n,) np.ndarray weights for the n features
"""
X_aug = add_bias_column(X) # (m, n+1)
XtX = X_aug.T @ X_aug # (n+1, n+1)
Xty = X_aug.T @ y # (n+1,)
# Prefer solve() to inv() for numerical stability
w_aug = np.linalg.solve(XtX, Xty) # (n+1,)
w0, w = w_aug[0], w_aug[1:]
return w0, w
def predict_linear(X, w0, w):
"""Predict in the model's target space (here: log1p)."""
return w0 + X @ w
def rmse(y_true, y_pred):
return np.sqrt(np.mean((y_true - y_pred) ** 2))
Usage with our pipeline (log target):
# X_train, y_train are your training features/targets (y is log1p(msrp))
w0, w = train_linear_regression_normal(X_train, y_train)
# Validation in log space
y_val_pred_log = predict_linear(X_val, w0, w)
val_rmse_log = rmse(y_val, y_val_pred_log)
# Or convert to dollars for human-readable error
y_val_pred = np.expm1(y_val_pred_log)
y_val_true = np.expm1(y_val) # invert the validation targets too
val_rmse_dollars = rmse(y_val_true, y_val_pred)
Be explicit about which space your metric lives in (log vs dollars) and stick to it consistently.
6) Shapes and sanity checks
- $\mathbf{X}$: $(m, n)$ (no bias column yet)
- $\mathbf{X}_{\text{aug}}$: $(m, n{+}1)$ after adding ones
- $\tilde{\mathbf{w}}$: $(n{+}1,)$
- Predictions: $(m,)$
Common pitfalls:
- Mismatched shapes due to missing bias column.
- Object dtypes sneaking into
X
(ensure floats). - Forgetting to invert predictions with
expm1
when reporting in dollars.
7) Numerical stability & when the inverse “doesn’t exist”
np.linalg.solve(XtX, Xty)
needs $\mathbf{X}^\top\mathbf{X}$ to be non-singular (full rank). Problems arise when features are:
- Collinear (one is a linear combination of others),
- Duplicated, or
- Have wildly different scales (ill-conditioning).
Fixes:
- Use the pseudo-inverse:
w_aug = np.linalg.pinv(X_aug) @ y
(robust to singularXtX
). - Use
np.linalg.lstsq
:w_aug, *_ = np.linalg.lstsq(X_aug, y, rcond=None)
(preferred in many cases). - Add regularization (Ridge): solve $(X^\top X + \lambda I)\,w = X^\top y$. We’ll cover this soon.
Scaling features (e.g., standardization) also improves conditioning and numerical behavior.
8) End-to-end training checklist
- Use only the training split to fit $w_0, w$.
- Predict on validation; compute RMSE (either in log space or dollars).
- Keep the test split untouched until the very end.
- If you trained on
log1p(msrp)
, remember to usenp.expm1
before reporting dollar errors.
9) What’s next
You now have a working training routine for linear regression via the normal equation. Next, we’ll:
- compare with iterative solvers (gradient descent),
- add regularization to handle collinearity and stabilize solutions,
- and examine how these choices affect validation RMSE.