Linear Regression in Vector & Matrix Form
In the last lesson we wrote linear regression for a single example as a sum of feature contributions plus a bias. Now we’ll generalize that idea into a compact vector form (dot product) and then to the matrix–vector form that works on the entire dataset at once. This lets us compute predictions efficiently and reason about dimensions clearly.
1) Quick recap: single example, scalar form
For one car (one row of features) with $n$ features, we wrote the prediction as
\[\hat{y}_i \;=\; w_0 \;+\; \sum_{j=1}^{n} w_j\,x_{ij}\]- $x_{ij}$: the $j$-th feature of the $i$-th example
- $w_0$: bias (intercept)
- $w_j$: weight for feature $j$
This is the scalar version: one number plus a sum over feature–weight products.
2) Dot product view (vector form)
Group the features of example $i$ into a feature vector $\mathbf{x}_i \in \mathbb{R}^n$ and the weights into $\mathbf{w} \in \mathbb{R}^n$. Then the sum becomes a dot product:
\[\hat{y}_i \;=\; w_0 \;+\; \mathbf{w}^\top \mathbf{x}_i\]Same computation, just cleaner notation. In code, if x_i
and w
are 1-D arrays of length n
, then:
pred_i = w0 + np.dot(w, x_i) # or: w0 + x_i @ w
3) Fold the bias into the dot product
The bias “hangs outside” the dot. We can incorporate it by augmenting the feature vector with a leading 1 and the weight vector with the bias:
\[\tilde{\mathbf{x}}_i \;=\; \begin{bmatrix} 1 \\ \mathbf{x}_i \end{bmatrix} \in \mathbb{R}^{n+1}, \quad \tilde{\mathbf{w}} \;=\; \begin{bmatrix} w_0 \\ \mathbf{w} \end{bmatrix} \in \mathbb{R}^{n+1}\]Now:
\[\hat{y}_i \;=\; \tilde{\mathbf{w}}^\top \tilde{\mathbf{x}}_i\]This trick simplifies code because all predictions become pure dot products.
Implementation pattern:
# x_i: shape (n,)
x_i_aug = np.concatenate(([1.0], x_i)) # shape (n+1,)
pred_i = x_i_aug @ w_aug # w_aug includes bias
4) From one row to all rows: matrix–vector form
Stack all $m$ augmented feature vectors $\tilde{\mathbf{x}}_i^\top$ as rows of a matrix $\mathbf{X} \in \mathbb{R}^{m \times (n+1)}$. Then predictions for all examples are a single matrix–vector product:
\[\hat{\mathbf{y}} \;=\; \mathbf{X}\,\tilde{\mathbf{w}} \quad\text{with}\quad \mathbf{X}= \begin{bmatrix} 1 & x_{11} & \dots & x_{1n} \\ 1 & x_{21} & \dots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x_{m1} & \dots & x_{mn} \\ \end{bmatrix}\]- Shapes: $\mathbf{X}$ is $(m \times (n{+}1))$, $\tilde{\mathbf{w}}$ is $((n{+}1) \times 1)$, so $\hat{\mathbf{y}}$ is $(m \times 1)$.
NumPy pattern:
# X: (m, n) raw features
X_aug = np.hstack([np.ones((m, 1)), X]) # (m, n+1)
y_hat = X_aug @ w_aug # (m,)
That’s the whole model application in one line.
5) Minimal reference implementation
Below is a tiny, self-contained “predict” path that mirrors the math above. (Training will set w_aug
; here we only show forward prediction.)
import numpy as np
def add_bias_column(X):
"""Add a leading column of ones to X (m, n) -> (m, n+1)."""
m = X.shape[0]
return np.hstack([np.ones((m, 1)), X])
def predict_matrix(X, w_aug):
"""
X: (m, n) raw features (without bias column)
w_aug: (n+1,) weights including bias
returns: (m,) predictions in the model's target space
"""
X_aug = add_bias_column(X) # (m, n+1)
return X_aug @ w_aug # (m,)
For a single row:
def predict_row(x_i, w_aug):
"""
x_i: (n,) one example
w_aug: (n+1,)
"""
x_i_aug = np.concatenate(([1.0], x_i)) # (n+1,)
return x_i_aug @ w_aug
6) Working with log-transformed targets
If your target is $y = \log(1+\text{msrp})$ (as in previous lessons), the model’s output is in log space. Convert back to dollars with:
y_pred_log = predict_matrix(X_val, w_aug)
y_pred = np.expm1(y_pred_log) # inverse of log1p
Be consistent: decide whether you’ll compute metrics in log space (e.g., RMSE of log targets) or in dollars (invert first, then compute RMSE). Both are legitimate; just be explicit.
7) Shapes, pitfalls, and good habits
- Bias handling: Add the column of ones yourself and set
fit_intercept=False
in libraries or let the library handle the intercept. Don’t do both. - Shape mismatches: A common error is mixing row-vectors and column-vectors. Prefer 2-D feature matrices
(m, n)
and 1-D weight arrays(n,)
(or(n+1,)
for augmented). Use the@
operator; avoid ambiguous broadcasting. - Copy vs view: When you create
X_aug
, use concatenation/stacking (as above) to avoid mutating the originalX
. - Float dtypes: Ensure numeric columns are
float
before matrix multiplies; implicit object dtypes will slow or break operations.
8) Why this matters
- The dot-product and matrix–vector view is not just notation—it unlocks vectorized computation, which is orders of magnitude faster than Python loops.
- Once in matrix form, training (solving for $\tilde{\mathbf{w}}$) becomes a linear-algebra problem, which we’ll tackle next (normal equations, gradient methods, and the role of regularization).
9) Where the weights come from (teaser)
We haven’t set the values of $\tilde{\mathbf{w}}$ yet. That’s the learning step: choose $\tilde{\mathbf{w}}$ to minimize a loss (typically mean squared error in log space) on the training split. In the next lesson we’ll derive and implement that, then evaluate on validation.