hadoop 成功运行mahout 的example之后,尝试运行mahout的kmeans算法。
过程中出现问题。
首先,依旧是classNotFound错误。
此错误,可以使用前面(一)中方法解决,将mahout的lib文件夹下的 .jar 文件 复制到hadoop的common 文件夹中。
另外想到一种应该可以的方法,就是将mahout的lib等相关路径放加入到PATH变量中,不过个人测试貌似不可以,这个还要进一步的思考或者大神解答。
其次,mahout中的算法的输入数据是要经过处理的。
要将mahout的处理数据进行序列化。
public class Text2VectorWritable extends AbstractJob{
public static void main(String[] args) throws Exception{
ToolRunner.run(new Configuration(), new Text2VectorWritable(),args);
}
@Override
public int run(String[] arg0) throws Exception{
addInputOption();
addOutputOption();
if (parseArguments(arg0) == null) {
return -1;
}
Path input=getInputPath();
Path output=getOutputPath();
Configuration conf=getConf();
//set job information
Job job=new Job(conf,"text2vectorwritablecopy with input"+input.getName());
job.setOutputFormatClass(SequenceFileOutputFormat.class);
job.setMapperClass(Text2VectorWritableMapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(VectorWritable.class);
job.setReducerClass(Text2VectorWritableReducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(VectorWritable.class);
job.setJarByClass(Text2VectorWritable.class);
FileInputFormat.addInputPath(job, input);
SequenceFileOutputFormat.setOutputPath(job, output);
if (!job.waitForCompletion(true) ) {
throw new InterruptedException("canopy job failed processing "+input);
}
return 0;
}
/**
* mapper:main procedure
*
*/
public static class Text2VectorWritableMapper extends Mapper<LongWritable, Text, LongWritable, VectorWritable>{
public void map(LongWritable key,Text value,Context context)throws IOException, InterruptedException{
String[] str=value.toString().split("\\s{1,}");
RandomAccessSparseVector vector=new RandomAccessSparseVector(str.length);
//put data in the vector
for (int i = 0; i < str.length; i++) {
vector.set(i, Double.parseDouble(str[i]));
}
VectorWritable va= new VectorWritable(vector);
context.write(key, va);
}
}
/*
* reducer :do nothing but output
*
*/
public static class Text2VectorWritableReducer extends Reducer<LongWritable, VectorWritable, LongWritable, VectorWritable>{
public void reduce(LongWritable key, Iterable<VectorWritable> values, Context context) throws IOException, InterruptedException{
for (VectorWritable v:values) {
context.write(key, v);
}
}
}
}
注意,运行的结果仍然是序列化文件,如果想正常查看,需要将其反序列化。
在运行参考资料中的反序列化mapreduce程序时,出现问,map参数问题。map的key的类型,参考书为Text,但是运行时,提示类型转换错误,改为Object后,同时将输出的key值变成new Text()。(相当于空字符串)
如此改变之后,就可以正常查看得到的结果。类的中心坐标值。
public class ReadClusterWritable extends AbstractJob{
public static void main(String[] args) throws Exception{
ToolRunner.run(new Configuration(), new ReadClusterWritable(),args);
}
@Override
public int run(String[] arg0) throws Exception{
addInputOption();
addOutputOption();
if (parseArguments(arg0) == null) {
return -1;
}
Path input=getInputPath();
Path output=getOutputPath();
Configuration conf=getConf();
//set job information
Job job=new Job(conf,"readclusterwritable "+getInputPath().toString());
job.setInputFormatClass(SequenceFileInputFormat.class);
job.setMapperClass(RM.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setNumReduceTasks(0);
job.setJarByClass(ReadClusterWritable.class);
FileInputFormat.addInputPath(job, input);
FileOutputFormat.setOutputPath(job, output);
if (!job.waitForCompletion(true) ) {
throw new InterruptedException("canopy job failed processing "+input);
}
return 0;
}
public static class RM extends Mapper<Object, ClusterWritable, Text, Text>{
//private org.slf4j.Logger logger=LoggerFactory.getLogger(RM.class);
public void map(Object key, ClusterWritable value,Context context)throws IOException,InterruptedException{
String str=value.getValue().getCenter().asFormatString();
//System.out.println("center********:"+str);
//logger.info("center**********:"+str);
context.write(new Text(), new Text(str));
}
}
}