Linear Regression

In this lesson we introduce linear regression and build up from an intuitive, single-example view to the compact mathematical form you’ll use in code. We’ll also connect predictions in log space back to dollars using expm1, consistent with the log1p(msrp) target we prepared earlier.

What linear regression is (and when to use it)

Problem type: regression — the output is a number (here: car price).
Idea: predict a target $y$ from input features $x$ using a linear function of those features.

At a high level:

\[\hat{y} = g(x) \quad\text{with}\quad g \text{ chosen to be linear}\]

Because we trained/plan to train on $\log(1 + \text{msrp})$, our model will output values in log space. We’ll invert with $\text{expm1}$ at the end to report prices in dollars.

Start with one car (single-example view)

Take a single row from training data (we only use train to fit/illustrate the model):

Engine horsepower: $x_1 = 453$
City MPG: $x_2 = 11$
Popularity: $x_3 = 86$

Collect these into a feature vector $\mathbf{x}i = (x{i1}, x_{i2}, x_{i3})$.

We want a function $g$ that maps this $\mathbf{x}_i$ to a predicted log-price that’s close to the true (log) price.

The linear model, first as a sum

Linear regression assumes:

\[\hat{y}_i \;=\; w_0 \;+\; w_1 x_{i1} \;+\; w_2 x_{i2} \;+\; w_3 x_{i3}\]

$w_0$ is the bias (intercept): the baseline prediction if we “knew nothing” about the car.
$w_j$ is the weight for feature $x_{ij}$: how much the prediction changes when that feature increases by 1 unit, holding others fixed.

Compactly, for $n$ features:

\[\hat{y}_i \;=\; w_0 \;+\; \sum_{j=1}^{n} w_j x_{ij}\]

Vector form (you’ll see/implement this soon):

\[\hat{y}_i \;=\; w_0 \;+\; \mathbf{w}^\top \mathbf{x}_i\]

A concrete plug-in example

Suppose (just for illustration) we pick:

$w_0 = 7.17$
$w_1 = 0.01$ (horsepower)
$w_2 = 0.04$ (city MPG)
$w_3 = 0.002$ (popularity)

Then for our car:

\[\hat{y}_i = 7.17 + 0.01\cdot 453 + 0.04\cdot 11 + 0.002\cdot 86 \approx 12.31\]

This $\hat{y}_i$ is in log space because our target is $\log(1+\text{msrp})$. Convert back to dollars with:

\[\widehat{\text{msrp}} \;=\; \operatorname{expm1}(\hat{y}_i)\]

Numerically, that gives roughly $222,000 for this car, which matches the intuition that high horsepower and low city MPG are associated with costlier cars.

Notes on interpretation
Bias $w_0$: the baseline (log) price with all features at zero (not literally meaningful if zero is outside the data range, but still useful mathematically).
Weights: positive $w_j$ means “more of $x_j$ → higher predicted price (in log space)”; negative means the opposite. Magnitudes are on the log scale; a unit change in $x_j$ adds $w_j$ to $\log(1+\text{price})$, i.e., a multiplicative change in price after exponentiation.

From one row to the whole training matrix

Let $\mathbf{X}$ be the feature matrix (rows = cars, columns = features) and $\mathbf{y}$ be the vector of targets (here, $y_i = \log(1+\text{msrp}_i)$).

Across all training examples:

\[\hat{\mathbf{y}} \;=\; w_0 \cdot \mathbf{1} \;+\; \mathbf{X}\mathbf{w}\]

Training linear regression means “find $w_0,\mathbf{w}$ that minimize error” between $\hat{\mathbf{y}}$ and $\mathbf{y}$ on the training set, typically by minimizing mean squared error in log space. In code you’ll either:

implement the math directly (normal equations / gradient methods), or
use a library routine and focus on data prep + evaluation.

Why we stayed in log space (and how to invert)

Because prices have a long right tail, modeling $\log(1+\text{msrp})$ yields:

a more symmetric, bell-shaped target,
more stable optimization,
errors that behave closer to relative errors in the original units.

At prediction time:

Model outputs $\hat{y} = \log(1+\widehat{\text{msrp}})$.
Convert to dollars with np.expm1(\hat{y}).

When you evaluate in dollars (e.g., RMSE in $), remember to invert predictions first.

What comes next

Generalize from one example to vectorized code for many rows.
Fit $w_0, \mathbf{w}$ on the training split, measure RMSE on validation.
Keep the test split untouched until the end for an unbiased final check.

That’s the full path from the intuitive sum of feature-contributions to a working linear model you can train and evaluate.