mahout贝叶斯并行分类源码分析

最新推荐文章于 2016-05-24 16:44:39 发布

原创最新推荐文章于 2016-05-24 16:44:39 发布 · 1.4k 阅读

2 ·

CC 4.0 BY-SA版权

mahout 专栏收录该内容

18 篇文章

订阅专栏

介绍Apache Mahout中贝叶斯分类器的实现机制，包括训练器、模型及分类器的功能与工作流程。

1 贝叶斯训练器

所在包：Package org.apache.mahout.classifier.bayes

实现机制

The implementation is divided up into three parts:

The Trainer -- responsible for doing the counting of the words and the labels
The Model -- responsible for holding the training data in a useful way
The Classifier -- responsible for using the trainers output to determine the category of previously unseen documents

1训练器

The trainer is manifested in several classes:

BayesDriver

创建Hadoop贝叶斯作业，输出模型，这个类封装了4个map/reduce类。
common.BayesFeatureDriver
common.BayesTfIdfDriver
common.BayesWeightSummerDriver
BayesThetaNormalizerDriver

训练器的输入是KeyValueTextInputFormat格式，第一个字符时类标签，剩余的是特征（单词），如下面的格式：

hockey puck stick goalie forward defenseman referee ice checking slapshot helmet
football field football pigskin referee helmet turf tackle

hockey 和football 是类标签，剩下的是特征。

2模型

所在包：org.apache.mahout.classifier.bayes

负责训练贝叶斯分类器，输入的格式：每一行是一个文本，第一个字符时类的标签，剩下的是特征（单词）

这个类会根据命令行参数调用两个训练器：

`static void`	`trainCNaiveBayes(org.apache.hadoop.fs.Path dir, org.apache.hadoop.fs.Path outputDir, BayesParameters` `params)`
`static void`	`trainNaiveBayes(org.apache.hadoop.fs.Path dir, org.apache.hadoop.fs.Path outputDir, BayesParameters` `params)`

trainCNaiveBayes函数调用CBayesDriver类；trainNaiveBayes会调用BayesDriver类

下面分别分析CBayesDriver类和BayesDriver类

BayesDriver所在包：org.apache.mahout.classifier.bayes.mapreduce.bayes

public class BayesDriverextends Objectimplements BayesJob

实现了BayesJob接口

在这个类的runJob函数里会调用调用4个map/reduce作业类

第一个：BayesFeatureDriver负责Read the features in each document normalized by length of each document

第二个：BayesTfIdfDriver负责Calculate the TfIdf for each word in each label

第三个：BayesWeightSummerDriver负责alculate the Sums of weights for each label, for each feature

第四个：BayesThetaNormalizerDriver负责：Calculate the normalization factor Sigma_W_ij for each complement class

下面分别分析这个四个类：

一个map/reduce类：BayesFeatureDriver

所在包：package org.apache.mahout.classifier.bayes.mapreduce.common;

输出key类型：StringTuple.class

输出value类型：DoubleWritable.class

输入格式：KeyValueTextInputFormat.class

输出格式：BayesFeatureOutputFormat.class

MAP：BayesFeatureMapper.class

REDUCE：BayesFeatureReducer.class

注意：BayesFeatureDriver可以独立运行，默认的输入和输出：

input = new Path("/home/drew/mahout/bayes/20news-input");

output = new Path("/home/drew/mahout/bayes/20-news-features");

p = new BayesParameters(1) gramsize默认为1

输出会生成三个文件

$OUTPUT/trainer-termDocCount

$OUTPUT/trainer-wordFreq

$OUTPUT/trainer-featureCount

下来的第二个map/reduce类BayesTfIdfDriver会根据这第一个的输出文件计算TF-IDF值，计算完毕后会删除这三个中间文件，并生成文件：trainer-tfIdf保存文本特征的it-idf值，

第三个：BayesWeightSummerDriver

输出key：StringTuple.class

输出value:DoubleWritable.class

输入路径：就是第二个map/reduce生成的trainer-tfIdf文件

输出：trainer-weights文件

输入文件格式：SequenceFileInputFormat.class

输出文件格式：BayesWeightSummerOutputFormat.class

第四个job：BayesThetaNormalizerDriver

输出key：StringTuple.class

输出value:DoubleWritable.class

输入路径：FileInputFormat.addInputPath(conf, new Path(output, "trainer-tfIdf/trainer-tfIdf"));就是需要使用第二个job的输出： trainer-tfIdf文件

输出路径：Path outPath = new Path(output, "trainer-thetaNormalizer");

会生成文件： trainer-thetaNormalizer

输出文件格式：SequenceFileOutputFormat.class

这个四个job执行完毕后整个bayes模型就建立完毕了，最后总共生成并保存三个目录文件：

trainer-tfIdf

trainer-thetaNormalizer

trainer-weights

模型建好了，下来就是测试分类器的效果

调用类：TestClassifier

所在包：package org.apache.mahout.classifier.bayes;

根据命令行参数会选择顺序执行还是并行map/reduce执行

并行执行回调用BayesClassifierDriver类

下面分析BayesClassifierDriver类

所在包：package org.apache.mahout.classifier.bayes.mapreduce.bayes;

输入格式：KeyValueTextInputFormat.class

输出格式：SequenceFileOutputFormat.class

执行完毕后会调用混合矩阵： ConfusionMatrix函数显示结果