1. mahout seqdirectory
$ mahout seqdirectory
--input (-i) input Path to job input directory(原始文本文件).
--output (-o) output The directory pathname for output.(<Text,Text>Sequence File)
-ow
功能: 将原始文本数据集转换为< Text, Text > SequenceFile
2. mahout seq2sparke
功能: Convert and preprocesses the dataset(<Text,Text> SequenceFile) into a < Text, VectorWritable > SequenceFile containing term frequencies for each document.
即根据Sequence File转换为tfidf向量文件
说明:If we wanted to use different parsing methods or transformations on the term frequency vectors we could supply different options here e.g.: -ng 2 for bigrams or -n 2 for L2 length normalization
mahout seq2sparse
--output (-o) output The directory pathname for output.
--input (-i) input Path to job input directory.
--weight (-wt) weight The kind of weight to use. Currently TF
or TFIDF. Default: TFIDF
--norm (-n) norm The norm to use, expressed as either a
float or "INF" if you want to use the
Infinite norm. Must be greater or equal
to 0. The default is not to normalize
--overwrite (-ow) If set, overwrite the output directory
--sequentialAccessVector (-seq) (Optional) Whether output vectors should
be SequentialAccessVectors. If set true
else false
--namedVector (-nv) (Optional) Whether output vectors should
be NamedVectors. If set true else false
-i Sequence File文件目录
-o 向量文件输出目录
-wt 权重类型,支持TF或者TFIDF两种选项,默认TFIDF
-n 使用的正规化,使用浮点数或者"INF"表示,
-ow 指定该参数,将覆盖已有的输出目录
-seq 指定该参数,那么输出的向量是SequentialAccessVectors
-nv 指定该参数,那么输出的向量是NamedVectors
3. mahout split
功能:Split the preprocessed dataset into training and testing sets.
将预处理的tfidf向量集转换为training和testing向量集
$ mahout split
-i ${WORK_DIR}/20news-vectors/tfidf-vectors
--trainingOutput ${WORK_DIR}/20news-train-vectors
--testOutput ${WORK_DIR}/20news-test-vectors
--randomSelectionPct 40
--overwrite --sequenceFiles -xm sequential
说明:如上是将向量数据集分为训练数据和检测数据,以随机40-60拆分
3. mahout trainnb
功能:训练分类器
mahout trainnb
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--alphaI (-a) alphaI Smoothing parameter. Default is 1.0
--trainComplementary (-c) Train complementary? Default is false.
--labelIndex (-li) labelIndex The path to store the label index in
--overwrite (-ow) If present, overwrite the output directory
before running job
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
-i 输入路径
-o 输出路径
-a
-c 补偿性训练
-li label index文件的目录
-ow 指定该参数,删除输出目录
tempDir MapReduce作业的中间结果
startPhase 运行的第一个阶段
endPhase 运行的最后一个阶段
4. mahout testnb
功能:检验Bayes分类器
mahout testnb
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--overwrite (-ow) If present, overwrite the output directory
before running job
--model (-m) model The path to the model built during training
--testComplementary (-c) Test complementary? Default is false.
--runSequential (-seq) Run sequential?
--labelIndex (-l) labelIndex The path to the location of the label index
--help (-h) Print out help
--tempDir tempDir Intermediate output directory
--startPhase startPhase First phase to run
--endPhase endPhase Last phase to run
-i 输入路径
-o 输出路径
-ow 覆盖输出目录
-c