Non Parametric Algorithms

2019-12-22

3 min read

As we have seen with the QDA and LDA classifiers, it is possible to obtain the posterior probabilities by estimating the class conditional probabilities using a (multivariate) Gaussian distribution. And we estimated the parameters of the Gaussian distribution, $\mu$ and $\Sigma$ based on set of (unbiased) examples.

Instead of using this parametric approach, it is also possible to estimate the class conditional probabilities using Kernel density estimation

Kernel density estimation: Here the main idea is that you approximate the distribution by a mixture of continuous distributions called kernels.

Histogram Density Estimation

In histogram density estimation the feature space is split in subregions (kernels) of width: $h$. Within each of these regions the number of objects of each class is counted. The class conditional probabilties are then calculated per region, and normalized by the kernel width and total amount of samples.

\[\begin{aligned} p(x|y_i) = \dfrac{1}{h} \dfrac{k_i}{n_i} \end{aligned}\]

The downside of the histogram method is that the feature space is discretely separated in bins. Therefore the estimated density will not be continuous throughout the feature space. Decreasing the bin size on the other hand causes problems in sparse datasets.

Parzen Density Estimation

In parzen density estimation, kernels are centered at every datatpoint $x_i$. The class conditional probability is the normalized sum of these kernels.

\[\begin{aligned} p(x|y_i) = \dfrac{1}{n_i h} \sum_{i=1}^{n_i} K( \dfrac{x - x_j}{h}) \end{aligned}\]

Given that the kernel used is wide enough and continuous, this method deals with the problems found in the histogram density estimation.

K-NN

Another method estimates the density of the distribution based on each data point $x$ his $k$ nearest neighbors. The metric for the density is based on the volume of a sphere $V_k$ centered at $x$ ,with radius $r$ being the distance to the $k^{th}$ nearest neighbor. This is divided by $n_i$ to normalize the density.

\[\begin{aligned} p(x|y_i) &= \dfrac{k}{n_iV_{k_i}(x)} \\ &= \dfrac{k}{n_i \pi (x-x_{k_i})^2 } \end{aligned}\]

The general hypothesis function for the two class case is as follows:

\[\begin{aligned} h(x) = \begin{cases} \text{if} \ p(y_1 | x) > p(y_2 | x) &: y=1 \\ \text{if} \ p(y_2 | x) > p(y_1 | x) &: y=2 \end{cases} \end{aligned}\]

The data distribution is similar to the class distribution, only non-class dependent:

\[\begin{aligned} p(x) &= \dfrac{k}{nV_k} \\ &= \dfrac{k}{n \pi (x-x_k)^2} \end{aligned}\]

Class priors are given by:

\[\begin{aligned} p(y_i) = \dfrac{n_i}{n} \end{aligned}\]

The class posteriors can be simplified, by removing non class dependant constants and multipliers.

\[\begin{aligned} p(y_i|x) &= \dfrac{p(x|y_i)p(y_i)}{p(x)} \\ &\Rightarrow p(x|y_i)p(y_i) \\ &= \dfrac{k}{n \pi} \dfrac{1}{(x-x_{k_i})^2} \\ &\Rightarrow \dfrac{1}{(x-x_{k_i})^2} \end{aligned}\]

This basically says that $x$ will be assigned to the class for which the euclidean distance to it’s $k^{th}$ nearest neighbor is smallest. Plugging the result in the (bernouilli) hypothesis function yields:

\[\begin{aligned} h(x) =& \begin{cases} \text{if} \ \dfrac{1}{(x-x_{k_1})^2} > \dfrac{1}{(x-x_{k_2})^2} &: y=1 \\ \text{if} \ \dfrac{1}{(x-x_{k_2})^2} > \dfrac{1}{(x-x_{k_1})^2} &: y=2 \end{cases} \\ \\ =& \begin{cases} \text{if} \ (x-x_{k_2})^2 > (x-x_{k_1})^2 &: y=1 \\ \text{if} \ (x-x_{k_1})^2 > (x-x_{k_2})^2 &: y=2 \\ \end{cases} \end{aligned}\]

It is good to note that, unlike the gaussian function used in the parametric density estimations, the probability estimated using knn, parzen or histogram methods are improper. This means their area does not integrate to one for $-\infty$ to $+\infty$. Since we are only interested in the relations of order for our hypothesis function, this does not matter. https://tex.stackexchange.com/questions/133932/how-do-i-type-the-infinity-symbol-in-mactex/133935