What is Supervised Machine Learning?
These notes are based on the video ML Zoomcamp 1.3 - Supervised Machine Learning
Supervised machine learning is a branch of machine learning where we teach algorithms by showing them examples. The term “supervised” comes from the fact that we act as teachers or supervisors, guiding the learning process by providing labeled examples.
The Core Concept
In supervised learning:
- We show the algorithm many examples with known outcomes (labels)
- The algorithm learns patterns from these examples
- The algorithm applies these patterns to make predictions on new, unseen examples
Examples We’ve Seen
In previous lessons, we explored:
- Car price prediction: We showed the model different cars with their known prices, allowing it to learn patterns that determine car values
- Spam detection: We showed the model examples of spam and non-spam messages, enabling it to identify patterns that distinguish between them
The Mathematics of Supervised Learning
Supervised learning uses concepts from mathematics and statistics to extract patterns from data.
Formal Notation
We represent our data using:
- Feature Matrix (X): A two-dimensional array where:
- Rows represent observations (examples)
- Columns represent features (characteristics)
- Target Vector (y): A one-dimensional array containing the values we want to predict
For example, in spam detection:
- X would contain features of emails (length, presence of certain words, etc.)
- y would contain labels (1 for spam, 0 for not spam)
The Goal of Supervised Learning
The goal is to find a function g (our model) such that:
g(X) ≈ y
In other words, when we apply our model g to the feature matrix X, it should produce predictions that are as close as possible to our target values y.
The Training Process
The process of finding this function g is called “training” and involves:
- Feeding the feature matrix X into the model
- Comparing the model’s predictions with the actual target values y
- Adjusting the model to minimize the difference between predictions and actual values
Types of Supervised Learning Problems
Based on the nature of the target variable and the output of our model, we can classify supervised learning into different types:
1. Regression Problems
- Output: A continuous numerical value
- Example: Car price prediction (predicting a price in dollars)
- Range: Can be any number within a range (often from -∞ to +∞)
- Other examples: House price prediction, temperature forecasting, stock price prediction
2. Classification Problems
- Output: A category or class label
- Example: Image classification (identifying objects in images)
Classification can be further divided into:
a. Multi-class Classification Problems
- Classifying into more than two categories
- Example: Identifying if an image contains a car, cat, or dog (3 classes)
b. Binary Classification Problems
- Classifying into exactly two categories
- Example: Spam detection (spam or not spam)
- The model often outputs a probability between 0 and 1
- The target variable is typically encoded as 0 or 1
3. Ranking Problems
- Output: An ordered list of items
- Examples:
- Recommender systems (showing products a user might like)
- Search engines (ordering results by relevance)
How Ranking Problems Work?
- The model assigns a score to each item (e.g., probability of user interest)
- Items are sorted by their scores
- Top N items are presented to the user
Examples include:
- E-commerce product recommendations
- Search engine results
- Content recommendation systems
Next Steps
In the next lesson, we’ll explore the bigger picture of organizing machine learning projects and discuss a methodology called CRISP-DM (Cross-Industry Standard Process for Data Mining), which provides a structured approach to planning and executing machine learning projects.