Naive Bayes

2019-12-25
1 min read

Naive Bayes

All generative algorithms mentioned above try to model the multivariate class conditional distribution over all features. However when the shear amount of features becomes to large, the number of parameters needed to model the distribution will become infeasible in practice. A Gaussian function for example scales $O(n + n^2) \simeq O(n^2)$ with amount of features.

The Naive Bayes classifier will therefore make a very strong assumption. All $x_i’s$ are considered conditionally independent given $y$. This assumption is called the Naive Bayes (NB) assumption.

\[\begin{aligned} p(x|y) = \prod_{j=1}^{n} p(x_j|y) \end{aligned}\]

What this means in practice for GDA algorithms, is that every feature will independently be modelled using a 1D Gaussian. After which the final probability will be the product of all independent probabilities. The amount of parameters needed for the Naive Bayes GDA variant would then scale with $O(2n)$ ($\mu$ and $\sigma$).

Naive Bayes classifiers are very popular classifiers for use cases with very large feature sets (for example text classification). More information about using Naive Bayes for text classification can be found here http://cs229.stanford.edu/notes/cs229-notes2.pdf