Model Selection Process

These notes are based on the video ML Zoomcamp 1.5 - Model Selection Process

Model selection is a crucial step in the machine learning workflow where we evaluate different models to determine which one performs best for our specific problem. This process follows the data preparation step, where we’ve already extracted features from our raw data.

During the model selection phase, we:

Try different model types (logistic regression, decision trees, neural networks, etc.)
Evaluate their performance systematically
Select the best performing model for deployment

Simulating Real-World Performance

The Challenge of Evaluation

When deploying a model in production, it will encounter new data it hasn’t seen during training. For example:

If we train a model in July using historical data
And deploy it in August to classify new emails as spam or not spam
We need to know how well it will perform on this unseen August data

The Holdout Method

To simulate this real-world scenario, we use the holdout method:

Take our complete dataset
Set aside a portion (e.g., 20%) and “hide” it
Train our model only on the remaining data (e.g., 80%)
Evaluate the model on the hidden portion

This hidden portion is called the validation dataset, and it helps us estimate how well our model will perform on new, unseen data.

Making and Evaluating Predictions

The Prediction Process

From the training data, we extract:
- Feature matrix X
- Target variable y
- Train our model g using X and y
From the validation data, we extract:
- Feature matrix X_validation
- Target variable y_validation (ground truth)
Apply our trained model g to X_validation to get predictions (y_hat)
Compare predictions (y_hat) with actual values (y_validation)

Example: Spam Detection Evaluation

For a spam detection model that outputs probabilities:

Predictions might be: 0.8, 0.7, 0.6, 0.1, 0.9, 0.6
Actual values might be: 1, 0, 1, 0, 1, 0 (where 1=spam, 0=not spam)
Using a threshold of 0.5, our predictions become: 1, 1, 1, 0, 1, 1
Comparing with actual values: correct, incorrect, correct, correct, correct, incorrect
Accuracy: 4 out of 6 = 66.7%

Comparing Multiple Models

We can repeat this process for different model types:

Logistic Regression: 66% accuracy
Decision Tree: 60% accuracy
Random Forest: 67% accuracy
Neural Network: 80% accuracy

Based on these results, we would select the Neural Network as our best model.

The Multiple Comparisons Problem

A Cautionary Example

Consider a scenario where we use random coin flips to “predict” spam:

Euro coin: 20% accuracy
Dollar coin: 40% accuracy
Polish zloty: 20% accuracy
Russian ruble: 20% accuracy
Ukrainian hryvnia: 100% accuracy

The Ukrainian hryvnia appears to be perfect, but this is purely by chance. It just happened to produce the exact sequence that matched our validation data.

This illustrates the multiple comparisons problem: when evaluating many models against the same validation dataset, one model might appear superior just by random chance, not because it truly captures the underlying patterns.

The Train-Validation-Test Split

To guard against the multiple comparisons problem, we use a three-way data split:

Training data (60%): Used to train models
Validation data (20%): Used to compare different models
Test data (20%): Used only once to evaluate the final selected model

The process works as follows:

Split the dataset into three non-overlapping subsets
Hide the test data completely until the final step
Train models on the training data
Evaluate and compare models on the validation data
Select the best performing model
Evaluate the selected model on the test data to get an unbiased estimate of its performance

In our example:

Neural network achieved 80% accuracy on validation data
When applied to the test data, it achieved 79% accuracy
This confirms the model is genuinely performing well, not just getting lucky on the validation set

The Complete Model Selection Process

The model selection process can be summarized in six steps:

Split the dataset into training, validation, and test sets
Train different models on the training data
Validate models using the validation dataset
Repeat steps 2-3 for various model types
Select the best performing model based on validation results
Evaluate the best model on the test data for final performance assessment

Optimizing the Final Model

After selecting the best model type, we can improve its performance by:

Combining the training and validation datasets (80% of original data)
Training a new model of the selected type on this combined dataset
Evaluating this final model on the test dataset

This approach allows us to use more data for the final model training while still maintaining an unbiased evaluation through the test set.