Linear Regression

2020-01-12
4 min read

The linear regression algorithm works by trying to predict the class label directly given an object. It tries to fit a (hyper)plane as good as possible through all the training examples, so that the output of the hyperplane $y$ predicts the class label. As you may have noticed the label $y$ will be predicted as a continuous value on the hyper plane. This makes it a pretty horrible fit for a (binary) classification problem, where the class label is discrete. However for regression problems, where the label is continuous it has the potential to be much better. For a dataset containing 2 features (feature space of $\Re^2$) this might look like:

\[\begin{aligned} y = h_{\theta} (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 \end{aligned}\]

Note how the hypothesis function for the predicted output in literature is conventionally denoted by $h$.

For an $n$ dimensional space this would equate to the following:

\[\begin{aligned} y = h_{\theta}(x) = \sum_{i=0}^n \theta_i x_i = \theta^T x \end{aligned}\]

Here $\theta_0$ is included in the statement by extending the feature vector $x$ for the bias term, so $x_0 = 1$.

Cost function

To learn a ‘good’ weight vector $\theta$ we first need a certain metric for what is considered ‘good’. This is where the cost function comes in. There are different choices possible to use as cost function, but we will be talking about the least squares cost here. When we assume the probability of outcome $y$ given feature vector $x$ to be of a normal distribution, the max log likelihood can be taken to arrive at the following least squares cost function formulation:

\[\begin{aligned} J(\theta) = \dfrac{1}{2} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^2 \end{aligned}\]

Derivation can be found at http://cs229.stanford.edu/notes/cs229-notes1.pdf

Gradient descend

From the cost function we can get a metric of how ‘good’ our linear classifier is, because we want the distance to our hyperplane from both classes to be as high as possible. To obtain the best linear classifier to we need to optimize our weight vector $\theta$ with respect to our cost function.Since this is usually unfeasible to do analytically (because of high dimensionality and data volume) we will be using the optimization algorithm gradient descent. Here we move in the direction of lowest cost by repeatedly performing the update:

\[\begin{aligned} \theta_j := \theta_j - \alpha \dfrac{\delta}{\delta \theta_j} J(\theta) \end{aligned}\]

Here $\alpha$ is also known as the learning rate. If $\alpha$ is too small gradient descent can be slow, if $\alpha$ is too large gradient descend may fail to converge.

LMS algorithm

The LMS algorithm uses the least squares cost function with the gradient descend optimizer. Let’s first work it out for the case of if we have only one training example (x,y), so that we can neglect the sum in the definition of J. We have:

\[\begin{aligned} \dfrac{\delta}{\delta \theta_j} J(\theta) &= \dfrac{\delta}{\delta \theta_j} \dfrac{1}{2} (h_{\theta}(x) - y)^2 \\ &= 2 \cdot \dfrac{1}{2} (h_{\theta}(x) - y) \cdot \dfrac{\delta}{\delta \theta_j} (h_{\theta}(x) - y) \\ &= (h_{\theta}(x) - y) \cdot \dfrac{\delta}{\delta \theta_j} ( \sum_{i=0}^{n} \theta_i x_i - y) \\ &= (h_{\theta}(x) - y) x_j \end{aligned}\]

For a single training example, this gives the update rule:

\[\begin{aligned} \theta_j := \theta_j + \alpha (y-h_{\theta}(x))x_j \end{aligned}\]

This rule is called the LMS (least mean squared) update rule, and is also known as the Widrow-Hoff learning rule. This update rule has some natural and intuitive properties. For instance the magnitude of the update is proportional to the error. The update rule using all samples of the training set can be written as follows:

This is called batch gradient descend. You can imagine that calculating the error for every sample in the dataset for every update step is very computationally expensive. Another version of gradient descend, called stochastic gradient descend aims to deal with this problem, by stochastically sampling a single training example. It is good to note that this update method will converge faster, however will never reach the optimum. Stochastic gradient descend is often used with large datasets, because you can get very close to the optimum faster.

Usage

Once $\theta$ has been ’trained’ properly, predictions can simple be made by the hypothesis function $h_{\theta}$.

Calling the hypothesis function is practice is sometimes called running inference.

Final notes: Gradient descend optimisation techniques can be susceptible to local minima in the cost function. This is however not the case in our example, because the least squares cost function is a convex quadratic function (it has no local minima).