Step 1:
下载数据集
http://qwone.com/~jason/20Newsgroups/
数据集名称:20news-19997.tar.gz
Step 2:
在装有Mahout的节点上,执行命令:
mkdir -p /opt/apps/mahout/apache-mahout-distribution-0.10.2/test
cd /opt/apps/mahout/apache-mahout-distribution-0.10.2/test
Step 3:
使用xftp将下载后的数据集上传到上述路径下
Step 4:
在三个节点上执行命令,启动zookeeper
zkServer.sh start
zkServer.sh status
Step 5:
在node11节点上启动HDFS和Yarn
start-all.sh
在node12节点上执行命令,启动Yarn的resourcemanager
yarn-daemon.sh start resourcemanager
Step 6:
打开浏览器,输入网址进行查看HDFS
192.168.80.11:50070
192.168.80.12:50070
打开浏览器,输入网址进行查看Yarn
192.168.80.11:8088
192.168.80.12:8088
Step 7:
在node11节点上执行命令,创建HDFS目录
hadoop fs -mkdir -p /ml/classification/bayesian/input
hadoop fs -ls /ml/classification/bayesian/input
Step 8:
在node11节点上执行命令,将刚才使用xftp传到Linux的文件上传到HDFS中
hadoop fs -put /opt/apps/mahout/apache-mahout-distribution-0.10.2/test/20_newsgroups /ml/classification/bayesian/input
hadoop fs -ls /ml/classification/bayesian/input
Step 9:
在node11节点上执行命令,将20newsgroups数据转换为序列文件
mahout seqdirectory -i /ml/classification/bayesian/input/20_newsgroups -o /ml/classification/bayesian/input/20_newsgroups_seq
Step 10:
在node11节点上执行命令,把序列文件转换为向量
mahout seq2sparse -i /ml/classification/bayesian/input/20_newsgroups_seq -o /ml/classification/bayesian/input/20_newsgroups_vectors -lnorm -nv -wt tfidf
Step 11:
在node11节点上执行命令,将输入数据分为训练和测试两部分
mahout split -i /ml/classification/bayesian/input/20_newsgroups_vectors/tfidf-vectors --trainingOutput /ml/classification/bayesian/input/20_train_vectors --testOutput /ml/classification/bayesian/input/20_test_vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential
Step 12:
在node11节点上执行命令,训练贝叶斯模型
mahout trainnb -i /ml/classification/bayesian/input/20_train_vectors -o /ml/classification/bayesian/input/model -li /ml/classification/bayesian/input/labelindex
Step 13:
在node11节点上执行命令,测试、评估贝叶斯模型
mahout testnb -i /ml/classification/bayesian/input/20_train_vectors -m /ml/classification/bayesian/input/model -l /ml/classification/bayesian/input/labelindex -ow -o /ml/classification/bayesian/input/20news-testing