Mahout之bayes算法学习(一)

朴素贝叶斯分类的正式定义如下:

      1、设为一个待分类项,而每个a为x的一个特征属性。

      2、有类别集合

      3、计算

      4、如果,则

 那么现在的关键就是如何计算第3步中的各个条件概率。关于理论知识这块,详细内容可以参照文章:

http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

那么了解它的算法思想后,我们就开始调用Mahout提供的Bayes算法,调用的方式主要有两种:linux命令运行 和 eclipse中java代码调用运行(Mahout版本0.7的)。

先讲下linux命令运行方式,这儿我们使用Mahout提供的示例,classify-20newsgroups.sh,关于新闻分类的示例。

在mahout-examples.jar包中bin文件夹有一个 classify-20newsgroups.sh;内容如下:

if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
  echo "This script runs SGD and Bayes classifiers over the classic 20 News Groups."
  exit
fi


SCRIPT_PATH=${0%/*}
if [ "$0" != "$SCRIPT_PATH" ] && [ "$SCRIPT_PATH" != "" ]; then
  cd $SCRIPT_PATH
fi
START_PATH=`pwd`


if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
  HADOOP="$HADOOP_HOME/bin/hadoop"
  if [ ! -e $HADOOP ]; then
    echo "Can't find hadoop in $HADOOP, exiting"
    exit 1
  fi
fi

(选择相关操作,cnaivebayes  完整朴素贝叶斯算法、naivebayes  朴素贝叶斯算法、sgd  随机梯度下降 算法、clean 清空工作目录
WORK_DIR=/tmp/mahout-work-${USER}
algorithm=( cnaivebayes naivebayes sgd clean)
if [ -n "$1" ]; then
  choice=$1
else
  echo "Please select a number to choose the corresponding task to run"
  echo "1. ${algorithm[0]}"
  echo "2. ${algorithm[1]}"
  echo "3. ${algorithm[2]}"
  echo "4. ${algorithm[3]} -- cleans up the work area in $WORK_DIR"
  read -p "Enter your choice : " choice
fi


echo "ok. You chose $choice and we'll use ${algorithm[$choice-1]}"
alg=${algorithm[$choice-1]}


if [ "x$alg" != "xclean" ]; then
  echo "creating work directory at ${WORK_DIR}"

(构造测试数据文件夹路径)
  mkdir -p ${WORK_DIR}
  if [ ! -e ${WORK_DIR}/20news-bayesinput ]; then
    if [ ! -e ${WORK_DIR}/20news-bydate ]; then
      if [ ! -f ${WORK_DIR}/20news-bydate.tar.gz ]; then

(如果没有测试数据,在线下载数据)
        echo "Downloading 20news-bydate"
        curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz -o ${WORK_DIR}/20news-bydate.tar.gz
      fi
      mkdir -p ${WORK_DIR}/20news-bydate

(解压数据)
      echo "Extracting..."
      cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
    fi
  fi
fi
#echo $START_PATH
cd $START_PATH
cd ../..


set -e


if [ "x$alg" == "xnaivebayes"  -o  "x$alg" == "xcnaivebayes" ]; then
  c=""


  if [ "x$alg" == "xcnaivebayes" ]; then
    c=" -c"
  fi


  set -x
  echo "Preparing 20newsgroups data"
  rm -rf ${WORK_DIR}/20news-all
  mkdir ${WORK_DIR}/20news-all
  cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

(本地数据上传到HDFS)
  if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
    echo "Copying 20newsgroups data to HDFS"
    set +e
    $HADOOP dfs -rmr ${WORK_DIR}/20news-all
    set -e
    $HADOOP dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
  fi

(数据序列化)
  echo "Creating sequence files from 20newsgroups data"
  ./bin/mahout seqdirectory \
    -i ${WORK_DIR}/20news-all \
    -o ${WORK_DIR}/20news-seq -ow

(数据转换成向量)
  echo "Converting sequence files to vectors"
  ./bin/mahout seq2sparse \
    -i ${WORK_DIR}/20news-seq \
    -o ${WORK_DIR}/20news-vectors  -lnorm -nv  -wt tfidf

将向量数据集分为训练数据和检测数据,以随机80-20拆分
  echo "Creating training and holdout set with a random 80-20 split of the generated vector dataset"
  ./bin/mahout split \
    -i ${WORK_DIR}/20news-vectors/tfidf-vectors \
    --trainingOutput ${WORK_DIR}/20news-train-vectors \
    --testOutput ${WORK_DIR}/20news-test-vectors  \
    --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

(得到训练模型)
  echo "Training Naive Bayes model"
  ./bin/mahout trainnb \
    -i ${WORK_DIR}/20news-train-vectors -el \
    -o ${WORK_DIR}/model \
    -li ${WORK_DIR}/labelindex \
    -ow $c


  echo "Self testing on training set"


  ./bin/mahout testnb \
    -i ${WORK_DIR}/20news-train-vectors\
    -m ${WORK_DIR}/model \
    -l ${WORK_DIR}/labelindex \
    -ow -o ${WORK_DIR}/20news-testing $c


  echo "Testing on holdout set"


  ./bin/mahout testnb \
    -i ${WORK_DIR}/20news-test-vectors\
    -m ${WORK_DIR}/model \
    -l ${WORK_DIR}/labelindex \
    -ow -o ${WORK_DIR}/20news-testing $c


elif [ "x$alg" == "xsgd" ]; then
  if [ ! -e "/tmp/news-group.model" ]; then
    echo "Training on ${WORK_DIR}/20news-bydate/20news-bydate-train/"
    ./bin/mahout org.apache.mahout.classifier.sgd.TrainNewsGroups ${WORK_DIR}/20news-bydate/20news-bydate-train/
  fi
  echo "Testing on ${WORK_DIR}/20news-bydate/20news-bydate-test/ with model: /tmp/news-group.model"
  ./bin/mahout org.apache.mahout.classifier.sgd.TestNewsGroups --input ${WORK_DIR}/20news-bydate/20news-bydate-test/ --model /tmp/news-group.model
elif [ "x$alg" == "xclean" ]; then
  rm -rf ${WORK_DIR}
  rm -rf /tmp/news-group.model
fi
# Remove the work directory
#



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值