Introduction to Supervised Learning

By A.Redjil

Published:

Introduction

This blog post serves as a light introduction to supervised learning. It introduces the basic ideas behind the field. For a deeper discussion, check out the references section. It assumes knowledge of basic linear algebra, real analysis, and probability.

Workflow

At its core, supervised learning is teaching an algorithm patterns through examples. The examples by which the algorithm is trained are in the form of labeled data, where each sample has an observation and an associated value (target or label). The training process happens iteratively until convergence. After training, the performance of the model is assessed by checking its predictions on previously unseen data.

Data

The key difference between supervised learning and other fields of machine learning is that it relies on labeled data. Formally, a labeled dataset is a set of \(N\) tuples of independent observations \(X_i\) and dependent labels \(y_i\). Each observation vector \(X_i\) contains \(p\) independent features. A supervised learning algorithm solves regression if the target values are continuous or classification if they are discrete.

\[ \mathcal{D} = \{(\mathbf{X}_i, y_i)\}_{i=1}^{N} \]

\(\mathbf{X}_i\) is the observation vector for sample \(i\), each entry of the vector represents a unique feature. \(y_i\) is the measured target value of sample \(i\). For example, a student’s attendance record can be represented as features, and the target is whether the student passed or not.

To check generalization, the dataset is divided into two disjoint subsets: training and testing data, used at different stages of the workflow. The training set usually contains \(80\%\) of the data (\(|\mathcal{D}_{train}|=M\)), and the remaining \(20\%\) are assigned to the test set (\(|\mathcal{D}_{test}|=N-M\)). This way we can check whether the algorithm really learned the underlying patterns or just memorized the training examples (overfitting).

Ideal Model

The targets are assumed to be generated from an unknown generative process \(f(\mathbf{X})\). Supervised learning approximates it with a parameterized model \(g(\mathbf{X};\theta)\). The observations \(\{X_i, \dots, X_N\}\) can be arranged into a design matrix of \(N\) rows (samples) and \(p\) columns (features), \(\mathbf{X} \in \mathbb{R}^{N \times p}\), and the targets into a vector \(\mathbf{y}\) with \(N\) elements.

\[ \mathbf{y} = f(\mathbf{X}) \]

However, in real settings the targets will contain measurement noise. The model then aims to uncover the process \(f(\mathbf{X})\) despite noise, and makes predictions \(\mathbf{\hat{y}}\):

\[ \mathbf{\hat{y}} = g(\mathbf{X};\theta) \] The model can take any functional form depending on the assumed relationship of the data. In simple cases a linear model is enough to uncover the patterns within the data, in more complex cases artificial neural networks are employed to identify non-linear patterns.

A good model is that which approximates the observed data well, and predicts well outputs of new unseen data. It does so by identifying the optimal parameters \(\mathbf{\hat{\theta}}\) that make the model close enough to the true generative process \[ g(\mathbf{X}; \mathbf{\hat{\theta}}) \approx f(\mathbf{X}) \]

Cost Function

The cost function \(\mathcal{C}\) measures how well the model predicts the labels based on the observations. It computes the distance between predictions \(\mathbf{\hat{y}}\) and true values \(\mathbf{y}\). In other words, it measures how far is the model from the true generative process

\[ \mathcal{C}(\mathbf{\hat{y}}, \mathbf{y}_{true}) = \sum_{i=1}^{M'} \ell(\hat{\!y}, y_i) \]

Here, \(\ell(\hat{y}_i, y_i)\) is the cost of a single sample. The costs are additive because the dataset contains independent elements. During training this is called the in-sample error \(E_{in}\), and during testing the out-of-sample error \(E_{out}\).

\[ E_{in}(\mathbf{\hat{y}}, \mathbf{y}) = \frac{1}{M} \sum\limits_{i=1}^{M}\ell(\hat{y}_i, y_i), \quad E_{out}(\mathbf{\hat{y}}, \mathbf{y}) = \frac{1}{N-M} \sum\limits_{i=1}^{N-M}\ell(\hat{y}_i, y_i) \] \(E_{in}(\mathbf{\hat{y}}, \mathbf{y})\) is used during training to fit the parameters of the model to the data, whereas, \(E_{out}(\mathbf{\hat{y}}, \mathbf{y})\) is used after trainign to assess the generalization power of the model on new unseen data.

For regression, a widely used cost function is the mean squared error (MSE):

\[ E_{in}(\mathbf{\hat{y}}, \mathbf{\mathbf{\hat{y}}}) = \frac{1}{M} \sum_{i=1}^{M}\left(\hat{y}_i - y_i\right)^2 \] It measures the mean squared distance between the predicted and true label values. Squaring the differences penalizes predictions that deviate substantially while rewarding predictions that are closer to the true values. To identify the optimal set of parameters the in-sample error \(E_{in}(\mathbf{\hat{y}}, \mathbf{y})\) should be minimized with respect to the parameters. \[ \mathbf{\hat{\theta}} = \arg\min\limits_{\theta} E_{in}(\mathbf{\hat{y}}, \mathbf{y}) \]

Cost function surface and contour plot
Figure 1. Left: 3D visualization of the cost function for a linear model with two parameters. Right: contour plot of the same cost function.

From a physics perspective, the cost function can be thought of as the Hamiltonian of the system, where the ground state corresponds to the set of parameters that minimize the cost function and capture the underlying patterns in the data.

Optimization: Gradient Descent

Gradient descent is an iterative algorithm for minimizing the cost function. Starting from an initial guess \(\theta_0\), it updates the parameters in the opposite direction of the gradient until convergence:

\[ \theta_{t+1} = \theta_t - \eta \nabla_{\theta} \mathcal{C}(\mathbf{\hat{y}}, \mathbf{y}) \]

Here \(\eta\) is the learning rate controlling the step size, \(\mathcal{C}(\mathbf{\hat{y}}, \mathbf{y})\) is the gradient of the cost function with respect to the parameters. Parameters \(\theta_t\) are updated iteratively until convergence.

Gradient Descent path with learning rate 0.1
Figure 2. Gradient descent convergence path using a learning rate of \(\eta = 0.1\).

It is important to choose the appropriate learning rate \(\eta\). If the learning rate is too small, say \(\eta=0.001\) gradient descent algorithm will require many more iterations to converge

Gradient Descent path with learning rate 0.1
Figure 3. Gradient descent convergence path using a learning rate of \(\eta = 0.0001\).

On the other hand if the learning rate is too high \(\eta =1.00\), gradient descent will diverge

Gradient Descent path with learning rate 0.1
Figure 4. Gradient descent convergence path using a learning rate of \(\eta = 1.0\).

Linear Regression

Linear regression assumes a linear relationship between features and targets: \[ y_{i\ell} = \sum_{j=1}^{p} x_{ij}\, \omega_{j\ell} + b_{\ell} + \epsilon, \quad i \in \{1,\dots, N\}, \quad \ell \in \{1, \dots, k\} \] where \(x_{ij}\) is the \(j\)-th feature of sample \(i\), \(\omega_{j\ell}\) is the true weight, \(b_{\ell}\) is the bias, and \(\epsilon\) is noise.

A linear model estimates \(\hat{y}_{i\ell}\) by varying parameters \(\theta_{j\ell}\): \[ \hat{y}_{i\ell} = \sum_{j=1}^{p} x_{ij}\, \theta_{j\ell} + b_{\ell} \]

The optimal parameters solve: \[ \widehat{\theta} = \arg\min\limits_{\theta_{j\ell} \in \mathbb{R}} \mathcal{C}(\mathbf{\hat{y}}, \mathbf{y}) \]

For visualization, consider the simple case of one feature (\(p=1\)), one target (\(k=1\)), and MSE as the cost function.

Python Implementation

Python Implementation \[ f(x) = a x + b \] Adding measurement noise to the labels \[ y = f(x) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, 1) \] The goal is to find a linear model \(g(x; \theta) = \theta_0 x + \theta_1\) that approximates the generating process \(f(x)\). Below is a fully Python code for linear regression. We start first by generating \(1000\) synthetic continuous data points in the interval \([0, 2]\). Afterwards, we generate, target values using the following formula, with \(a=1.00\), and \(b= 2.00\) \[ y = a x + b \] These values correspond to the ideal model that generated the data. In the last step of setting up the synthetic data, to each target point we add standard Gaussian noise \(\epsilon \sim \mathcal{N}(0, 1)\).

          
          import numpy as np

          # We start first by fixing the seed by reproducing the results 
          seed = 127
          # We initialize a random number generator 
          rng = np.random.default_rng(seed=seed)

          # We fix the True parameters
          a, b = 2.0, 1.0

          # Then we generate synthetic data
          X = 2 * rng.random(1000)            # Random samples in [0, 2]
          epsilon = rng.normal(0, 1.0, size=1000)  # Gaussian noise
          y_noisy = a * X + b + epsilon
          
        

True model vs Noisy Model
Figure 5. True vs noisy model output.

Next step is to split the dataset into training and testing. We do that by shuffling the incidences of the data and we assign \(80\%\) of them to the training data, and the remaining \(20\%\) to the test data.

          
        # Split into train/test
        train_ratio = 0.8
        N = len(X) # Full size of the data 
        train_size = int(train_ratio * N) # The training portion 

        indices = np.arange(N) # We get the indecies of the original data
          rng.shuffle(indices) # We then randomly shuffle the indices 
        train_idx, test_idx = indices[:train_size], indices[train_size:] # Then we split 
        # the indeices into training/testing indices  
        # We slice the training/testoing potions 
        X_train, X_test = X[train_idx].reshape(-1, 1), X[test_idx].reshape(-1, 1)
        y_train, y_test = y_noisy[train_idx].reshape(-1, 1), y_noisy[test_idx].reshape(-1, 1)

        # Add bias column
        X_train_b = np.c_[np.ones((X_train.shape[0], 1)), X_train]
        X_test_b = np.c_[np.ones((X_test.shape[0], 1)), X_test]        

        
Then, we initialize the parameters to a random values, the learning rate to \(0.1\) and the number of iteration to \(100\).
          
          theta = rng.normal(size=(2, 1))  # Random initialization
          n_iter = 100 # Fix the number of iterations 
          lr = 0.1 # Fix the learning rate 
          m_train = len(y_train) # Get the training data size 
          cost_history = [] # Intilize an empty list to store iteration cost values 
          
        
Last step is to learn the parameters. At each step we compute the gradient of the cost with respect to the parameters, then we update the value of the parameters based on the value of the gradient. We repeat this process for the \(100\) steps. and then we print the final estimation of the parameters to the screen.
          
        # Gradient Descent
        for _ in range(n_iter): # We iterate 
          # Compute the gradient with respect to the parameters 
            gradients = (2/m_train) * X_train_b.T.dot(X_train_b.dot(theta) - y_train)
            # Update the paramters 
            theta -= lr * gradients
            # Compute the cost function 
            cost = (1/m_train) * np.sum((X_train_b.dot(theta) - y_train)**2)
            # Append the cost function 
            cost_history.append(cost)
        # Get the value of the paramters 
        theta_0, theta_1 = theta.ravel()
        print(f"Estimated parameters: theta_0 = {theta_0:.2f}, theta_1 = {theta_1:.2f}")          
      
        

Results

After \(100\) iterations, the estimated parameters are: \[ \hat{\theta_0} \approx 2.00, \quad \hat{\theta_1}\approx 1.00 \] These are very close to the true parameters \(b=1.0\), \(a=2.0\).

Learned model
Figure 6. Learned vs ground truth model.

The figure shows the outputs of the true model vs the predictions of the linear model, where the blue circles correspond to the training data points and the yellow ones are the testing data. The predictions closely match the ground truth, and the cost function converges within the first 100 iterations.

Cost function
Figure 7. Cost function per iteration.

Conclusion

This post touched on the basic concepts of supervised machine learning. From how data is structured, to how the cost function measures performance, to how gradient optimizes the parameters. Then they were applied to the simple case of linear regression, where the model learned the true generative process behind the noisy measurements. However, this only scratches the surface. Real world applications are more complicated, yet they all build on the concepts presented in this blog.