Mahout 文本分类过程

最新推荐文章于 2023-12-27 17:55:35 发布

猿_area

最新推荐文章于 2023-12-27 17:55:35 发布

阅读量1.1k

点赞数

CC 4.0 BY-SA版权

分类专栏： mahout 文章标签： Mahout

本文链接：https://blog.youkuaiyun.com/zmc_happy_blog/article/details/25705695

mahout 专栏收录该内容

4 篇文章

订阅专栏

以下是官网提供的基于CBayes算法的文本分类过程：

End to end commands to build a CBayes model for 20 Newsgroups:

The 20 newsgroup example script issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script:

Be sure that MAHOUT_HOME/bin and HADOOP_HOME/bin are in your $PATH

Create a working directory for the dataset and all input/output.

    $ export WORK_DIR=/tmp/mahout-work-${USER}
    $ mkdir -p ${WORK_DIR}

Download and extract the 20news-bydate.tar.gz from the 20newsgroups dataset to the working directory.

    $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz 
        -o ${WORK_DIR}/20news-bydate.tar.gz
    $ mkdir -p ${WORK_DIR}/20news-bydate
    $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
    $ mkdir ${WORK_DIR}/20news-all
    $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

If you're running on a hadoop cluster

$ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all

Convert the full 20newsgroups dataset into a < Text, Text > sequence file.

    $ mahout seqdirectory 
        -i ${WORK_DIR}/20news-all 
        -o ${WORK_DIR}/20news-seq -ow

Convert and preprocesses the dataset into a < Text, VectorWritable > sequence file containing term frequencies for each document.
```
    $ mahout seq2sparse 
        -i ${WORK_DIR}/20news-seq 
        -o ${WORK_DIR}/20news-vectors
        -lnorm 
        -nv 
        -wt tfidf
```
If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bi-grams or -n 2 for L2 length normalization. See the Creating vectors from text for a list of all se2sparse options.

Split the preprocessed dataset into training and testing sets.

    $ mahout split 
        -i ${WORK_DIR}/20news-vectors/tfidf-vectors 
        --trainingOutput ${WORK_DIR}/20news-train-vectors 
        --testOutput ${WORK_DIR}/20news-test-vectors  
        --randomSelectionPct 40 
        --overwrite --sequenceFiles -xm sequential

Train the classifier.

    $ mahout trainnb 
        -i ${WORK_DIR}/20news-train-vectors -el 
        -o ${WORK_DIR}/model 
        -li ${WORK_DIR}/labelindex 
        -ow 
        -c

Test the classifier.

    $ mahhout testnb 
        -i ${WORK_DIR}/20news-test-vectors
        -m ${WORK_DIR}/model 
        -l ${WORK_DIR}/labelindex 
        -ow 
        -o ${WORK_DIR}/20news-testing 
        -c