Mahout之bayes算法学习（一）-优快云博客

本文链接：https://blog.youkuaiyun.com/zhoujianfeng3/article/details/37766017

朴素贝叶斯分类的正式定义如下：

1、设 $x=\{a_1,a_2,...,a_m\}$ 为一个待分类项，而每个a为x的一个特征属性。

2、有类别集合 $C=\{y_1,y_2,...,y_n\}$ 。

3、计算 $P(y_1|x),P(y_2|x),...,P(y_n|x)$ 。

4、如果 $P(y_k|x)=max\{P(y_1|x),P(y_2|x),...,P(y_n|x)\}$ ，则 $x \in y_k$ 。

那么现在的关键就是如何计算第3步中的各个条件概率。关于理论知识这块，详细内容可以参照文章：

http://www.cnblogs.com/leoo2sk/archive/2010/09/17/naive-bayesian-classifier.html

那么了解它的算法思想后，我们就开始调用Mahout提供的Bayes算法，调用的方式主要有两种：linux命令运行和 eclipse中java代码调用运行（Mahout版本0.7的）。

先讲下linux命令运行方式，这儿我们使用Mahout提供的示例，classify-20newsgroups.sh，关于新闻分类的示例。

在mahout-examples.jar包中bin文件夹有一个 classify-20newsgroups.sh；内容如下：

if [ "$1" = "--help" ] || [ "$1" = "--?" ]; then
echo "This script runs SGD and Bayes classifiers over the classic 20 News Groups."
exit
fi

SCRIPT_PATH=${0%/*}
if [ "$0" != "$SCRIPT_PATH" ] && [ "$SCRIPT_PATH" != "" ]; then
cd $SCRIPT_PATH
fi
START_PATH=`pwd`

if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
HADOOP="$HADOOP_HOME/bin/hadoop"
if [ ! -e $HADOOP ]; then
echo "Can't find hadoop in $HADOOP, exiting"
exit 1
fi
fi

（选择相关操作，cnaivebayes 完整朴素贝叶斯算法、naivebayes 朴素贝叶斯算法、sgd 随机梯度下降算法、clean 清空工作目录）
WORK_DIR=/tmp/mahout-work-${USER}
algorithm=( cnaivebayes naivebayes sgd clean)
if [ -n "$1" ]; then
choice=$1
else
echo "Please select a number to choose the corresponding task to run"
echo "1. ${algorithm[0]}"
echo "2. ${algorithm[1]}"
echo "3. ${algorithm[2]}"
echo "4. ${algorithm[3]} -- cleans up the work area in $WORK_DIR"
read -p "Enter your choice : " choice
fi

echo "ok. You chose $choice and we'll use ${algorithm[$choice-1]}"
alg=${algorithm[$choice-1]}

if [ "x$alg" != "xclean" ]; then
echo "creating work directory at ${WORK_DIR}"

（构造测试数据文件夹路径）
mkdir -p ${WORK_DIR}
if [ ! -e ${WORK_DIR}/20news-bayesinput ]; then
if [ ! -e ${WORK_DIR}/20news-bydate ]; then
if [ ! -f ${WORK_DIR}/20news-bydate.tar.gz ]; then

（如果没有测试数据，在线下载数据）
echo "Downloading 20news-bydate"
curl http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz -o ${WORK_DIR}/20news-bydate.tar.gz
fi
mkdir -p ${WORK_DIR}/20news-bydate

（解压数据）
echo "Extracting..."
cd ${WORK_DIR}/20news-bydate && tar xzf ../20news-bydate.tar.gz && cd .. && cd ..
fi
fi
fi
#echo $START_PATH
cd $START_PATH
cd ../..

set -e

if [ "x$alg" == "xnaivebayes" -o "x$alg" == "xcnaivebayes" ]; then
c=""

if [ "x$alg" == "xcnaivebayes" ]; then
c=" -c"
fi

set -x
echo "Preparing 20newsgroups data"
rm -rf ${WORK_DIR}/20news-all
mkdir ${WORK_DIR}/20news-all
cp -R ${WORK_DIR}/20news-bydate/*/* ${WORK_DIR}/20news-all

（本地数据上传到HDFS）
if [ "$HADOOP_HOME" != "" ] && [ "$MAHOUT_LOCAL" == "" ] ; then
echo "Copying 20newsgroups data to HDFS"
set +e
$HADOOP dfs -rmr ${WORK_DIR}/20news-all
set -e
$HADOOP dfs -put ${WORK_DIR}/20news-all ${WORK_DIR}/20news-all
fi

（数据序列化）
echo "Creating sequence files from 20newsgroups data"
./bin/mahout seqdirectory \
-i ${WORK_DIR}/20news-all \
-o ${WORK_DIR}/20news-seq -ow

（数据转换成向量）
echo "Converting sequence files to vectors"
./bin/mahout seq2sparse \
-i ${WORK_DIR}/20news-seq \
-o ${WORK_DIR}/20news-vectors -lnorm -nv -wt tfidf

（将向量数据集分为训练数据和检测数据，以随机80-20拆分）
echo "Creating training and holdout set with a random 80-20 split of the generated vector dataset"
./bin/mahout split \
-i ${WORK_DIR}/20news-vectors/tfidf-vectors \
--trainingOutput ${WORK_DIR}/20news-train-vectors \
--testOutput ${WORK_DIR}/20news-test-vectors \
--randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

（得到训练模型）
echo "Training Naive Bayes model"
./bin/mahout trainnb \
-i ${WORK_DIR}/20news-train-vectors -el \
-o ${WORK_DIR}/model \
-li ${WORK_DIR}/labelindex \
-ow $c

echo "Self testing on training set"

./bin/mahout testnb \
-i ${WORK_DIR}/20news-train-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c

echo "Testing on holdout set"

./bin/mahout testnb \
-i ${WORK_DIR}/20news-test-vectors\
-m ${WORK_DIR}/model \
-l ${WORK_DIR}/labelindex \
-ow -o ${WORK_DIR}/20news-testing $c

elif [ "x$alg" == "xsgd" ]; then
if [ ! -e "/tmp/news-group.model" ]; then
echo "Training on ${WORK_DIR}/20news-bydate/20news-bydate-train/"
./bin/mahout org.apache.mahout.classifier.sgd.TrainNewsGroups ${WORK_DIR}/20news-bydate/20news-bydate-train/
fi
echo "Testing on ${WORK_DIR}/20news-bydate/20news-bydate-test/ with model: /tmp/news-group.model"
./bin/mahout org.apache.mahout.classifier.sgd.TestNewsGroups --input ${WORK_DIR}/20news-bydate/20news-bydate-test/ --model /tmp/news-group.model
elif [ "x$alg" == "xclean" ]; then
rm -rf ${WORK_DIR}
rm -rf /tmp/news-group.model
fi
# Remove the work directory
#