Naive Bayes classifier

最新推荐文章于 2024-10-21 03:32:30 发布

原创最新推荐文章于 2024-10-21 03:32:30 发布 · 1.6k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#variables #classification #features #parameters #class #statistics

朴素贝叶斯分类器是一种简单的概率分类器，基于应用贝叶斯定理，并假设特征间相互独立。该分类器认为每个特征独立地对类别概率做出贡献。尽管这种假设在现实中很少成立，但朴素贝叶斯分类器在许多复杂的真实世界情境中表现出意外的良好效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem (from Bayesian statistics) with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model".

In simple terms, a naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 4" in diameter. Even though these features depend on the existence of the other features, a naive Bayes classifier considers all of these properties to independently contribute to the probability that this fruit is an apple.

Depending on the precise nature of the probability model, naive Bayes classifiers can be trained very efficiently in a supervised learning setting. In many practical applications, parameter estimation for naive Bayes models uses the method of maximum likelihood; in other words, one can work with the naive Bayes model without believing in Bayesian probability or using any Bayesian methods.

In spite of their naive design and apparently over-simplified assumptions, naive Bayes classifiers often work much better in many complex real-world situations than one might expect. Recently, careful analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of naive Bayes classifiers.^[1] An advantage of the naive Bayes classifier is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

The naive Bayes probabilistic model

Abstractly, the probability model for a classifier is a conditional model

$p(C /vert F_1,/dots,F_n)/,$

over a dependent class variable $C$ with a small number of outcomes or classes, conditional on several feature variables $F 1$ through $F n$ . The problem is that if the number of features $n$ is large or when a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable.

Using Bayes' theorem, we write

$p(C /vert F_1,/dots,F_n) = /frac{p(C) / p(F_1,/dots,F_n/vert C)}{p(F_1,/dots,F_n)}. /,$

In plain English the above equation can be written as

$/mbox{posterior} = /frac{/mbox{prior} /times /mbox{likelihood}}{/mbox{evidence}}. /,$

In practice we are only interested in the numerator of that fraction, since the denominator does not depend on $C$ and the values of the features $F i$ are given, so that the denominator is effectively constant. The numerator is equivalent to the joint probability model

$p(C, F_1, /dots, F_n)/,$

which can be rewritten as follows, using repeated applications of the definition of conditional probability:

$p(C, F_1, /dots, F_n)/,$

$= p(C) / p(F_1,/dots,F_n/vert C)$

$= p(C) / p(F_1/vert C) / p(F_2,/dots,F_n/vert C, F_1)$

$= p(C) / p(F_1/vert C) / p(F_2/vert C, F_1) / p(F_3,/dots,F_n/vert C, F_1, F_2)$

$= p(C) / p(F_1/vert C) / p(F_2/vert C, F_1) / p(F_3/vert C, F_1, F_2) / p(F_4,/dots,F_n/vert C, F_1, F_2, F_3)$

$= p(C) / p(F_1/vert C) / p(F_2/vert C, F_1) / p(F_3/vert C, F_1, F_2) / /dots p(F_n/vert C, F_1, F_2, F_3,/dots,F_{n-1}).$

Now the "naive" conditional independence assumptions come into play: assume that each feature $F i$ is conditionally independent of every other feature $F j$ for $j/neq i$ . This means that

$p(F_i /vert C, F_j) = p(F_i /vert C)/,$

and so the joint model can be expressed as

$p(C, F_1, /dots, F_n) = p(C) / p(F_1/vert C) / p(F_2/vert C) / p(F_3/vert C) / /cdots/,$

$= p(C) /prod_{i=1}^n p(F_i /vert C)./,$

This means that under the above independence assumptions, the conditional distribution over the class variable $C$ can be expressed like this:

$p(C /vert F_1,/dots,F_n) = /frac{1}{Z} p(C) /prod_{i=1}^n p(F_i /vert C)$

where $Z$ is a scaling factor dependent only on $F_1,/dots,F_n$ , i.e., a constant if the values of the feature variables are known.

Models of this form are much more manageable, since they factor into a so-called class prior $p (C)$ and independent probability distributions $p(F_i/vert C)$ . If there are $k$ classes and if a model for each $p(F_i/vert C=c)$ can be expressed in terms of $r$ parameters, then the corresponding naive Bayes model has (k − 1) + n r k parameters. In practice, often $k = 2$ (binary classification) and $r = 1$ (Bernoulli variables as features) are common, and so the total number of parameters of the naive Bayes model is $2 n + 1$ , where $n$ is the number of binary features used for prediction.