Mahout 文本分类过程

以下是官网提供的基于CBayes算法的文本分类过程:

End to end commands to build a CBayes model for 20 Newsgroups:

The 20 newsgroup example script issues the following commands as outlined above. We can build a CBayes classifier from the command line by following the process in the script:

Be sure that MAHOUT_HOME/bin and HADOOP_HOME/bin are in your $PATH

  1. Create a working directory for the dataset and all input/output.

        $ export WORK_DIR=/tmp/mahout-work-${USER}
        $ mkdir -p ${WORK_DIR}
    
  2. Download and extract the 20news-bydate.tar.gz from the 20newsgroups dataset to the working directory.

        $ curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz 
            -o ${WORK_DIR}/20news-bydate.tar.gz
        $ mkdir -p ${WORK_DIR}/20news-bydate
        $ cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
        $ mkdir ${WORK_DIR}/20news-all
        $ cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all
    
    • If you're running on a hadoop cluster

      $ hadoop dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
      
  3. Convert the full 20newsgroups dataset into a < Text, Text > sequence file.

        $ mahout seqdirectory 
            -i ${WORK_DIR}/20news-all 
            -o ${WORK_DIR}/20news-seq -ow
    
  4. Convert and preprocesses the dataset into a < Text, VectorWritable > sequence file containing term frequencies for each document.

        $ mahout seq2sparse 
            -i ${WORK_DIR}/20news-seq 
            -o ${WORK_DIR}/20news-vectors
            -lnorm 
            -nv 
            -wt tfidf
    

    If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bi-grams or -n 2 for L2 length normalization. See the Creating vectors from text for a list of all se2sparse options.

  5. Split the preprocessed dataset into training and testing sets.

        $ mahout split 
            -i ${WORK_DIR}/20news-vectors/tfidf-vectors 
            --trainingOutput ${WORK_DIR}/20news-train-vectors 
            --testOutput ${WORK_DIR}/20news-test-vectors  
            --randomSelectionPct 40 
            --overwrite --sequenceFiles -xm sequential
    
  6. Train the classifier.

        $ mahout trainnb 
            -i ${WORK_DIR}/20news-train-vectors -el 
            -o ${WORK_DIR}/model 
            -li ${WORK_DIR}/labelindex 
            -ow 
            -c
    
  7. Test the classifier.

        $ mahhout testnb 
            -i ${WORK_DIR}/20news-test-vectors
            -m ${WORK_DIR}/model 
            -l ${WORK_DIR}/labelindex 
            -ow 
            -o ${WORK_DIR}/20news-testing 
            -c
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值