Setting Up the Validation Framework

In this lesson, we’ll build a reproducible validation framework for our car-price model. Concretely, we will:

Split the dataset into train, validation, and test subsets (60% / 20% / 20%).
Shuffle the rows to break any accidental order in the data.
Extract targets (y) with a log1p transform for stability.
Remove the target from the feature tables to prevent leakage.
Reset indices and verify sizes so the pipeline is tidy and reproducible.

This gives us a clean, dependable base to evaluate models before we start tuning or engineering features.

1) Why we split into train/validation/test

Train: used to fit parameters.
Validation: used repeatedly to choose models/hyperparameters.
Test: touched once, at the very end, as an unbiased estimate of performance.

Keeping the test set isolated ensures we don’t “tune to the test” by accident.

2) Decide the split sizes (60/20/20) and compute counts

We’ll compute integer sizes using the total number of rows n. Because rounding can cause sums to drift, compute train as the remainder after picking val and test.

import numpy as np
import pandas as pd

# df is your cleaned DataFrame
n = len(df)

n_val  = int(round(0.20 * n))
n_test = int(round(0.20 * n))
n_train = n - n_val - n_test

n, n_train, n_val, n_test

This guarantees n_train + n_val + n_test == n.

3) Shuffle indices for a fair split

Real datasets often come with structure (e.g., grouped by manufacturer or time). Splitting sequentially can push an entire group into one split (bad). We’ll shuffle row indices and seed the RNG for reproducibility.

np.random.seed(2)             # reproducible splits
idx = np.arange(n)            # 0..n-1
np.random.shuffle(idx)        # in-place shuffling

Now take contiguous chunks from the shuffled index for each split:

idx_train = idx[:n_train]
idx_val   = idx[n_train:n_train + n_val]
idx_test  = idx[n_train + n_val:]

4) Slice the DataFrame by iloc with shuffled indices

Use iloc with the shuffled index arrays to create the three DataFrames:

df_train = df.iloc[idx_train].copy()
df_val   = df.iloc[idx_val].copy()
df_test  = df.iloc[idx_test].copy()

We .copy() to avoid chained-assignment pitfalls.

5) Reset the row indices (optional but tidy)

The split DataFrames will carry the original row indices. Resetting makes them clean and easier to inspect:

for part in (df_train, df_val, df_test):
    part.reset_index(drop=True, inplace=True)

6) Build the target vectors (`y`) with log1p

We will predict msrp. As discussed in the previous lesson, msrp has a long right tail; applying a log transform (np.log1p) compresses extremes and often stabilizes regression.

y_train = np.log1p(df_train['msrp']).values
y_val   = np.log1p(df_val['msrp']).values
y_test  = np.log1p(df_test['msrp']).values

Notes:

We store them as NumPy arrays (models expect arrays).
Later, when we predict, we’ll invert with np.expm1.

7) Remove the target from the feature tables to avoid leakage

Leaving msrp in the feature table is a classic leakage bug (the model “cheats”). Delete it after extracting y.

for part in (df_train, df_val, df_test):
    del part['msrp']

From here on, df_* contain only features, while y_* contain the log-transformed target.

8) Sanity checks

Always verify sizes and alignment:

assert len(df_train) == len(y_train) == n_train
assert len(df_val)   == len(y_val)   == n_val
assert len(df_test)  == len(y_test)  == n_test

If you see mismatches, re-check rounding and slicing boundaries.

9) (Optional) Notes on alternatives and robustness

Single-call split utilities: Libraries (e.g., scikit-learn’s train_test_split) are convenient, but here we implemented the split manually to emphasize the mechanics and reproducibility.
Stratification in regression: There’s no native “stratify” for continuous targets, but you can bin the target (e.g., quantiles of msrp) and stratify on those bins to balance ranges across splits. This is optional and situation-dependent.
Temporal data: If the data has time order and you plan to forecast, use time-based splits rather than random shuffles. For our static car listings, random splits are fine.

10) What we have now—and what’s next

We now have:

df_train, df_val, df_test — feature tables
y_train, y_val, y_test — log-transformed targets
A reproducible 60/20/20 split with a fixed seed
No target leakage in features

Next lesson: we’ll implement linear regression on df_train → y_train, evaluate with RMSE on df_val, and establish a clear baseline before adding feature engineering or regularization.