http://blog.youkuaiyun.com/u011239443/article/details/76176743
朴素贝叶斯的基本原理与简单的python与scala的实现可以参阅:http://blog.youkuaiyun.com/u011239443/article/details/68061124
贝叶斯估计
如果一个给定的类和特征值在训练集中没有一起出现过,那么基于频率的估计下该概率将为0。这将是一个问题。因为与其他概率相乘时将会把其他概率的信息统统去除。所以常常要求要对每个小类样本的概率估计进行修正,以保证不会出现有为0的概率出现。常用到的平滑就是加1平滑(也称拉普拉斯平滑):
P(Xj=ajl|Y=ck)=∑Ni=1I(x(j)i=ajl,yi=ck)+lambda∑Ni=1I(yi=ck)+Sjlambda
lambda>=0,等价于在随机变量各个取值的频数上赋予一个正数lambda>0。
Sj
是特征
Xj
取值的类别数,因此使用上式依然有:
∑Sjl=1P(Xj=ajl|Y=ck)=1
同样的:
P(Y=ck)=∑Ni=1I(yi=ck)+lambdaN+Klambda
N为数据条数,K为label类别数。
多项式朴素贝叶斯
多项式朴素贝叶斯和上述贝叶斯模型不同的是,上述贝叶斯模型对于某特征的不同取值代表着不同的类别,而多项式朴素贝叶斯对于某特征的不同取值代表着该特征决定该label类别的重要程度。
比如一个文本中,单词Chinese
出现的频数,1次还是10次,并不代表着Chinese
单词这个特征的类别,而代表着Chinese
单词这个特征的决定该文本label类别的重要程度。
log(p(yi))=log(∑Ni=1I(yi=ck)+lambda)−log(N+Klambda)
log(P(aj|yi))=log(∑Ni=1aj,yi=ck+lambda)−log(∑Ni=1∑nj=1aj,yi=ck+nlambda)
n为特征维度数
我们来举个例子:

我们设lambda为1,共有6个不同的单词,则特征维度数为6。




所以,我们将d5 分类到 yes
API 使用
下面是Spark 朴素贝叶斯的使用例子:
import org.apache.spark.ml.classification.NaiveBayes
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
val model = new NaiveBayes()
.fit(trainingData)
val predictions = model.transform(testData)
predictions.show()
源码分析
接下来我们来分析下源码~
NaiveBayes
train
NaiveBayes().fit
调用NaiveBayes
的父类Predictor
中的fit
,将label
和weight
转为Double
,保存label
和weight
原信息,最后调用NaiveBayes
的train
:
override protected def train(dataset: Dataset[_]): NaiveBayesModel = {
trainWithLabelCheck(dataset, positiveLabel = true)
}
trainWithLabelCheck:
ml假设输入labels范围在[0, numClasses). 但是这个实现也被mllib NaiveBayes调用,它允许其他类型的输入labels如{-1, +1}. positiveLabel
用于确定label是否需要被检查。
private[spark] def trainWithLabelCheck(
dataset: Dataset[_],
positiveLabel: Boolean): NaiveBayesModel = {
if (positiveLabel && isDefined(thresholds)) {
val numClasses = getNumClasses(dataset)
require($(thresholds).length == numClasses, this.getClass.getSimpleName +
".train() called with non-matching numClasses and thresholds.length." +
s" numClasses=$numClasses, but thresholds has length ${$(thresholds).length}")
}
val modelTypeValue = $(modelType)
val requireValues: Vector => Unit = {
modelTypeValue match {
case Multinomial =>
requireNonnegativeValues
......
}
}
val instr = Instrumentation.create(this, dataset)
instr.logParams(labelCol, featuresCol, weightCol, predictionCol, rawPredictionCol,
probabilityCol, modelType, smoothing, thresholds)
val numFeatures = dataset.select(col($(featuresCol))).head().getAs[Vector](0).size
instr.logNumFeatures(numFeatures)
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
val aggregated = dataset.select(col($(labelCol)), w, col($(featuresCol))).rdd
.map { row => (row.getDouble(0), (row.getDouble(1), row.getAs[Vector](2)))
}.aggregateByKey[(Double, DenseVector)]((0.0, Vectors.zeros(numFeatures).toDense))(
seqOp = {
case ((weightSum: Double, featureSum: DenseVector), (weight, features)) =>
requireValues(features)
BLAS.axpy(weight, features, featureSum)
(weightSum + weight, featureSum)
},
combOp = {
case ((weightSum1, featureSum1), (weightSum2, featureSum2)) =>
BLAS.axpy(1.0, featureSum2, featureSum1)
(weightSum1 + weightSum2, featureSum1)
}).collect().sortBy(_._1)
val numLabels = aggregated.length
instr.logNumClasses(numLabels)
val numDocuments = aggregated.map(_._2._1).sum
val labelArray = new Array[Double](numLabels)
val piArray = new Array[Double](numLabels)
val thetaArray = new Array[Double](numLabels * numFeatures)
val lambda = $(smoothing)
val piLogDenom = math.log(numDocuments + numLabels * lambda)
var i = 0
aggregated.foreach { case (label, (n, sumTermFreqs)) =>
labelArray(i) = label
piArray(i) = math.log(n + lambda) - piLogDenom
val thetaLogDenom = $(modelType) match {
case Multinomial => math.log(sumTermFreqs.values.sum + numFeatures * lambda)
......
}
var j = 0
while (j < numFeatures) {
thetaArray(i * numFeatures + j) = math.log(sumTermFreqs(j) + lambda) - thetaLogDenom
j += 1
}
i += 1
}
val pi = Vectors.dense(piArray)
val theta = new DenseMatrix(numLabels, numFeatures, thetaArray, true)
val model = new NaiveBayesModel(uid, pi, theta).setOldLabels(labelArray)
instr.logSuccess(model)
model
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
NaiveBayesModel
model.transform
调用NaiveBayesModel
的父类ProbabilisticClassificationModel
中的transform
,根据表列配置,有选择的预测并添加以下三列:
- predicted labels:
Double
类型,预测的label - raw predictions:
Vector
类型,数字可为负数,数值越大,表示该类别越可行 - probability of each class:
Vector
类型,各类别的概率
这边我们就只分析predicted labels
流程:transform
最终会调用predict
:
override protected def predict(features: FeaturesType): Double = {
raw2prediction(predictRaw(features))
}
predictRaw
其实就是在计算raw predictions
,而raw2prediction
正是在从中选取最可信的:
protected def raw2prediction(rawPrediction: Vector):
Double = rawPrediction.argmax
predictRaw
下面我们来看看NaiveBayesModel
的predictRaw
实现:
override protected def predictRaw(features: Vector): Vector = {
$(modelType) match {
case Multinomial =>
multinomialCalculation(features)
......
}
}
multinomialCalculation:
private def multinomialCalculation(features: Vector) = {
val prob = theta.multiply(features)
BLAS.axpy(1.0, pi, prob)
prob
}