1、首先下载newsgroups数据集
数据集网址为http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz,将数据集解压,会得到两个文件夹20news-bydate-test和20news-bydate-train,将两个文件夹合并存入20news-all文件夹
2、将数据集转化为sequencefile方便mahout操作。
本地运行需要配置本地MAHOUT
export MAHOUT_HOME=MAHOUT_DIR //MAHOUT_DIR为本地MAHOUT目录
export MAHOUT_LOCAL=$MAHOUT_HOME
此过程需要在本地完成,因为hadoop擅长处理大文件,而此过程需要将很多小文件进行序列化形成一个大的序列化文件。
$ mahout seqdirectory
-i ${WORK_DIR}/20news-all //input FileDirectory
-o 20news-seq //output FileDirectory
-ow //overwrite
运行结束后将本地MAHOUT_LOCAL配置去掉
export -n MAHOUT_LOCAL //删除MAHOUT_LOCAL本地设置3、将20news-seq上传至hdfs
hadoop fs -put ${WORK_DIR}/20news-seq ${HDFS_DIR}/4、转换和预处理数据集变成 <Text,VectorWritable>格式
$ mahout seq2sparse
-i ${HDFS_DIR}/20news-seq
-o ${HDFS_DIR}/20news-vectors
-lnorm //(Optional) Whether output vectors should be logNormalize. If set true else false
-nv //(Optional) Whether output vectors should be NamedVectors. If set true else false
-wt tfidf //The kind of weight to use. Currently TF or TFIDF. Default: TFIDF
5、将数据集进行分片,分为训练集和测试集。
$ mahout split
-i ${HDFS_DIR}/20news-vectors/tfidf-vectors
--trainingOutput ${HDFS_DIR}/20news-train-vectors //The training data output directory
--testOutput ${HDFS_DIR}/20news-test-vectors //The test data output directory
--randomSelectionPct 20 //Percentage of items to be randomly selected as test data when using
--overwrite
--sequenceFiles //Set if the input files are sequence files. Default is false
-xm sequential //串行执行6、训练贝叶斯分类模型
$ mahout trainnb
-i ${WORK_DIR}/20news-train-vectors
-el //Extract the labels from the input
-o ${WORK_DIR}/model
-li ${WORK_DIR}/labelindex //The path to store the label index in
-ow //overwrite
-c //train complementary?
7、测试贝叶斯分类模型
$ mahout testnb
-i ${WORK_DIR}/20news-test-vectors
-m ${WORK_DIR}/model //The path to the model built during training
-l ${WORK_DIR}/labelindex //The path to the location of the label index
-ow
-o ${WORK_DIR}/20news-testing //test complementary?
-c
8、运算结果
=======================================================
Summary
-------------------------------------------------------
Correctly Classified Instances : 2107 91.2121%
Incorrectly Classified Instances : 203 8.7879%
Total Classified Instances : 2310
=======================================================
Confusion Matrix
-------------------------------------------------------
a b c d e f g h i j k l m n o p q r s t <--Classified as
91 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 4 0 | 97 a = alt.atheism
0 104 0 2 1 3 3 0 0 0 0 0 1 0 1 0 0 0 0 0 | 115 b = comp.graphics
0 14 84 14 4 8 2 0 0 0 0 0 1 0 0 0 0 0 1 1 | 129 c = comp.os.ms-windows.misc
0 2 2 121 4 0 0 1 0 0 0 0 3 0 0 0 0 0 0 0 | 133 d = comp.sys.ibm.pc.hardware
0 4 0 2 122 0 4 0 0 0 0 0 1 0 0 0 0 0 0 0 | 133 e = comp.sys.mac.hardware
0 6 1 3 2 109 1 0 0 0 0 0 0 0 1 0 0 0 0 0 | 123 f = comp.windows.x
0 1 0 4 1 1 93 4 2 0 0 1 3 0 0 0 0 0 0 0 | 110 g = misc.forsale
0 0 0 0 0 0 1 101 5 0 0 0 3 0 1 0 0 0 0 0 | 111 h = rec.autos
0 0 0 0 1 0 0 3 132 0 0 0 0 0 0 0 0 0 0 0 | 136 i = rec.motorcycles
0 0 0 1 0 0 0 0 1 119 1 0 0 0 0 0 0 0 0 0 | 122 j = rec.sport.baseball
0 0 0 0 0 0 1 1 0 0 125 0 0 0 0 0 0 0 0 0 | 127 k = rec.sport.hockey
0 2 0 0 0 2 0 0 0 0 0 117 0 0 1 0 0 0 0 1 | 123 l = sci.crypt
0 2 0 1 3 0 2 1 0 0 0 0 112 1 0 0 0 0 0 0 | 122 m = sci.electronics
0 0 0 1 0 0 1 0 1 0 0 1 2 100 2 0 0 0 0 2 | 110 n = sci.med
0 1 0 0 0 0 0 0 0 0 0 0 0 0 112 0 0 0 0 0 | 113 o = sci.space
1 1 0 0 0 0 1 0 0 0 0 0 0 0 2 96 0 0 2 1 | 104 p = soc.religion.christian
1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 110 0 0 2 | 115 q = talk.politics.mideast
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 111 0 0 | 113 r = talk.politics.guns
9 0 0 0 0 0 0 0 1 1 1 0 0 0 0 4 1 0 66 0 | 83 s = talk.religion.misc
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 3 2 3 82 | 91 t = talk.politics.misc
=======================================================
Statistics
-------------------------------------------------------
Kappa 0.8648
Accuracy 91.2121%
Reliability 86.7628%
Reliability (standard deviation) 0.2128
本文介绍如何利用Apache Mahout构建文本分类器的过程,包括数据集准备、数据预处理、训练贝叶斯分类模型及评估分类效果等关键步骤。
2317

被折叠的 条评论
为什么被折叠?



