1 贝叶斯训练器
所在包:Package org.apache.mahout.classifier.bayes
实现机制
The implementation is divided up into three parts:
-
The Trainer -- responsible for doing the counting of the words and the labels
-
The Model -- responsible for holding the training data in a useful way
-
The Classifier -- responsible for using the trainers output to determine the category of previously unseen documents
1训练器
The trainer is manifested in several classes:
-
创建Hadoop贝叶斯作业,输出模型,这个类封装了4个map/reduce类。
训练器的输入是KeyValueTextInputFormat格式,第一个字符时类标签,剩余的是特征(单词),如下面的格式:
hockey puck stick goalie forward defenseman referee ice checking slapshot helmet
football field football pigskin referee helmet turf tackle
hockey 和football 是类标签,剩下的是特征。
2模型
所在包:org.apache.mahout.classifier.bayes
负责训练贝叶斯分类器,输入的格式:每一行是一个文本,第一个字符时类的标签,剩下的是特征(单词)
这个类会根据命令行参数调用两个训练器:
|
|
|
|
|
|
trainCNaiveBayes函数调用CBayesDriver类;trainNaiveBayes会调用BayesDriver类
下面分别分析CBayesDriver类和BayesDriver类
BayesDriver所在包:org.apache.mahout.classifier.bayes.mapreduce.bayes
public class BayesDriverextendsObjectimplements BayesJob
实现了BayesJob接口
在这个类的runJob函数里会调用调用4个map/reduce作业类
第一个:BayesFeatureDriver负责Read the features in each document normalized by length of each document
第二个:BayesTfIdfDriver负责Calculate the TfIdf for each word in each label
第三个:BayesWeightSummerDriver负责alculate the Sums of weights for each label, for each feature
第四个:BayesThetaNormalizerDriv负责:Calculate the normalization factor Sigma_W_ij for each complement class
下面分别分析这个四个类:
一个map/reduce类:BayesFeatureDriver
所在包:package org.apache.mahout.classifier.bayes.mapreduce.common;
输出key类型:StringTuple.class
输出value类型:DoubleWritable.class
输入格式:KeyValueTextInputFormat.class
输出格式:BayesFeatureOutputFormatclass
MAP:BayesFeatureMapper.class
REDUCE:BayesFeatureReducer.class
注意:BayesFeatureDriver可以独立运行,默认的输入和输出:
input = new Path("/home/drew/mahout/bayes/20news-input");
output = new Path("/home/drew/mahout/bayes/20-news-features");
p = new BayesParameters(1) gramsize默认为1
输出会生成三个文件
$OUTPUT/trainer-termDocCount
$OUTPUT/trainer-wordFreq
$OUTPUT/trainer-featureCount
下来的第二个map/reduce类BayesTfIdfDriver会根据这第一个的输出文件计算TF-IDF值,计算完毕后会删除这三个中间文件,并生成文件:trainer-tfIdf保存文本特征的it-idf值,
第三个:BayesWeightSummerDriver
输出key:StringTuple.class
输出value:DoubleWritable.class
输入路径:就是第二个map/reduce生成的trainer-tfIdf文件
输出:trainer-weights文件
输入文件格式:SequenceFileInputFormat.class
输出文件格式:BayesWeightSummerOutputFclass
第四个job:BayesThetaNormalizerDriv
输出key:StringTuple.class
输出value:DoubleWritable.class
输入路径:FileInputFormat.addInputPath(conf, new Path(output, "trainer-tfIdf/trainer-tfIdf"));就是需要使用第二个job的输出: trainer-tfIdf文件
输出路径:Path outPath = new Path(output, "trainer-thetaNormalizer");
会生成文件: trainer-thetaNormalizer
输出文件格式:SequenceFileOutputFormatclass
这个四个job执行完毕后整个bayes模型就建立完毕了,最后总共生成并保存三个目录文件:
trainer-tfIdf
trainer-thetaNormalizer
trainer-weights
模型建好了,下来就是测试分类器的效果
调用类:TestClassifier
所在包:package org.apache.mahout.classifier.bayes;
根据命令行参数会选择顺序执行还是并行map/reduce执行
并行执行回调用BayesClassifierDriver类
下面分析BayesClassifierDriver类
所在包:package org.apache.mahout.classifier.bayes.mapreduce.bayes;
输入格式:KeyValueTextInputFormat.class
输出格式:SequenceFileOutputFormatclass
执行完毕后会调用混合矩阵: ConfusionMatrix函数显示结果
介绍Apache Mahout中贝叶斯分类器的实现机制,包括训练器、模型及分类器的功能与工作流程。
187

被折叠的 条评论
为什么被折叠?



