Generalized Linear Models

2020-02-17
4 min read

In my previous posts I have talked about linear and logistic regression. Today I will talk about the broader family of models to which both methods belong Generalized Linear Models. To work our way up to GLMs, we will begin by defining the exponential family.

The exponential family: A class of distributions is in the exponential family if it can be written in the form:

$$ p(y;\eta) = b(y) \text{exp}(\eta^T T(y) - \alpha(\eta)) $$

More information can be found here

Linear regression as a GLM

Maybe you remember that the underlying assumption used in the least squares cost function of linear regression was that, the conditional distribution of $y$ given $x$ is defined by a gaussian (normal) distribution. It turns out a gaussian distribution is part of the exponential family and can be written as follows:

\[\begin{aligned} p(y;\mu) =& \dfrac{1}{\sqrt{2\pi}} \text{exp} \left( -\dfrac{1}{2} (y-\mu)^2 \right) \\ =& \dfrac{1}{\sqrt{2\pi}} \text{exp} \left( -\dfrac{1}{2}y^2 \right) \cdot \text{exp} \left( \mu y - \dfrac{1}{2} \mu^2 \right) \\ \\ \eta =& \mu \\ T(y) =& y \\ a(\eta) =& \dfrac{\mu^2}{2} \\ =& \dfrac{\eta^2}{2} \\ b(y) =& \dfrac{1}{\sqrt{2\pi}} \text{exp}(\dfrac{y^2}{2}) \end{aligned}\]

Note that standard deviation $\sigma$ has been set to 1. This is done because it makes the derivation easier and since it is of no influence on our final choice of $\theta$ we are allowed to choose it arbitrarily.

One of the assumptions made when constructing a GLM is that the natural parameter $\eta$ and the inputs $x$ are related linearly: $\eta = \theta^T x$. Which brings us back to final form the hypothesis function of the linear regression algorithm. To formulate the hypothesis function of a GLM we equate it to the expected value of the conditional distribution:

\[\begin{aligned} h_{\theta}(x) =& E[y|\eta] = E[y|x;\theta] \\ =& \mu \\ =& \eta \\ =& \theta^T x \\ \end{aligned}\]

Logistic regression as a GLM

The same procedure can be used to derive the logistic regression classifier as a GLM. The classifier assumes a bernoulli conditional distribution of $y$ given $x$. Rewriting the bernoulli distribution as part of the exponential family gives:

\[\begin{aligned} p(y;\phi) =& \phi^y (1-\phi)^{1-y} \\ =& \text{exp} (y \text{log} \phi + (1-y) \text{log}(1-\phi)) \\ =& \text{exp} \left( \left( \text{log} \left( \dfrac{\phi}{1-\phi} \right) \right) y + \text{log}(1-\phi) \right) \\ \newline \eta &= \text{log} (\dfrac{\phi}{1-\phi}) \\ T(y) &= y \\ a(\eta) &= -\text{log}(1-\phi) \\ &= -\text{log}(1- e^{\eta}) \\ b(y) &= 1 \end{aligned}\]

Again to formulate the hypothesis function of the logistic regression classifier we equate to the expected value of the conditional distribution:

\[\begin{aligned} h_{\theta}(x) =& E[y|\eta] = E[y|x;\theta] \\ &= \phi \\ &= \dfrac{1}{1+e^{-\eta}} \\ &= \dfrac{1}{1+e^{-\theta^T x}} \\ \end{aligned}\]

Softmax regression

An example of a GLM that I have not mentioned before in other posts is the softmax regression algorithm. This class of GLM that assumes a multinomial conditional distribution of $y$ given $x$. This type of distribution is appropriate for classification with more than two classes.

To derive the hypothesis function for the softmax algorithm we will start by expressing the multinomial as an exponential family distribution. Up until now we have only seen $T(y) = y$, however for the multinomial distribution the sufficient statistic $T(y)$ is actually a vector of size equal to the number of classes:

\[\begin{aligned} p(y;\phi) =& \phi^{1\{y=1\}}_1 \phi^{1\{y=2\}}_2 ... \phi^{1\{y=k\}}_k \\ =& \phi^{1\{y=1\}}_1 \phi^{1\{y=2\}}_2 ... \phi^{1 - \Sigma^{k-1}_{i=1}1\{y=i\}}_k \\ =& \phi^{(T(y))_1}_1 \phi^{(T(y))_2}_2 ... \phi^{1 - \Sigma^{k-1}_{i=1}(T(y))_i}_k \\ =& \text{exp} ((T(y))_1 \text{log}(\phi_1) + (T(y))_2 \text{log}(\phi_2) + ... + (1-\Sigma^{k-1}_{i=1} (T(y))_i)\text{log}(\phi_k)) \\ =& \text{exp} ((T(y))_1 \text{log}(\dfrac{\phi_1}{\phi_k}) + (T(y))_2 \text{log}(\dfrac{\phi_2}{\phi_k}) + ... + (T(y))_{k-1}\text{log}(\dfrac{\phi_{k-1}}{\phi_k}) + \text{log}(\phi_k)) \\ \end{aligned}\]
\[\begin{aligned} \eta =& \begin{bmatrix} \log(\dfrac{\phi_1}{\phi_k}) \\ \log(\dfrac{\phi_2}{\phi_k}) \\ \vdots \\ \log(\dfrac{\phi_{k-1}}{\phi_k}) \end{bmatrix} \\ a(\eta) =& -\log(\phi_k) \\ b(y) =& 1 \end{aligned}\]

Here the notation 1{$\cdot$} takes on a value of 1 if its argument is true, and 0 otherwise. Also $(T(y))_i$ denotes the $i-$th element of the vector $T(y)$.

Because $T(y)$ is a vector the hypothesis function will also be a vector which contains a hypothesis for every class. The expectation of a single class in obtained from the natural parameter as follows:

\[\begin{aligned} \eta_i =& \log \dfrac{\phi_i}{\phi} \\ e^{\eta_i} =& \dfrac{\phi_i}{\phi} \\ \phi_k e^{\eta_i} =& \phi_i \\ \\ \phi_k \Sigma^k_{i=1} e^{\eta_i} =& \Sigma^k_{i=1} \phi_i = 1 \\ \\ \phi_i =& \dfrac{e^{\eta_i}}{\Sigma^k_{j=1} e^{\eta_j}} \end{aligned}\]

Finally The entire hypothesis function will look as follows:

\[\begin{aligned} h_{\theta}(x) =& E[T(y)|x;\theta] \\ =& \begin{bmatrix} \dfrac{e^{\theta^T_1 x}}{\Sigma^k_{j=1} e^{\theta^T_j x}} \\ \dfrac{e^{\theta^T_2 x}}{\Sigma^k_{j=1} e^{\theta^T_j x}} \\ \vdots \\ \dfrac{e^{\theta^T_{k-1} x}}{\Sigma^k_{j=1} e^{\theta^T_j x}} \\ \end{bmatrix} \end{aligned}\]

Note that to use the softmax algorithm as a classifier all you have to do is output the class for which the hypothesis is highest.

We have not talked about the cost function and the update rule to actually train this classifier. The appropriate update rule can be derived using maximum likelihood estimation. I will not go into detail about this here.

See … for more details.