以下是官网提供的基于CBayes算法的文本分类过程:
End to end commands to build a CBayes model for 20 Newsgroups:
The 20 newsgroup example script issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script:
Be sure that MAHOUT_HOME/bin and HADOOP_HOME/bin are in your $PATH
-
Create a working directory for the dataset and all input/output.
$ export WORK_DIR=/tmp/mahout-work-${USER} $ mkdir -p ${WORK_DIR}
-
Download and extract the 20news-bydate.tar.gz from the 20newsgroups dataset to the working directory.
$ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz -o ${WORK_DIR}/20news-bydate.tar.gz $ mkdir -p ${WORK_DIR}/20news-bydate $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd .. $ mkdir ${WORK_DIR}/20news-all $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
-
If you're running on a hadoop cluster
$ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
-
-
Convert the full 20newsgroups dataset into a < Text, Text > sequence file.
$ mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq -ow
-
Convert and preprocesses the dataset into a < Text, VectorWritable > sequence file containing term frequencies for each document.
$ mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf
If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bi-grams or -n 2 for L2 length normalization. See the Creating vectors from text for a list of all se2sparse options.
-
Split the preprocessed dataset into training and testing sets.
$ mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
-
Train the classifier.
$ mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow -c
-
Test the classifier.
$ mahhout testnb -i ${WORK_DIR}/20news-test-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing -c