IK Analyzer基于Hadoop MapReducer框架Java实现:
1、新建一个ChineseWordCount类
2、在该类中再建一个私有静态类CWCMapper继承Mapper类,并复写Mapper类中map方法。
PS:Mapper的4个泛型分别为:输入key类型,通常为LongWritable,为偏移量;输入value类型;输出key类型;输出value类型
private static class CWCMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
/**
*要注意编码格式,本次红楼梦txt文档为GBK编码格式,则需要转换编码格式
*转换编码格式,不能先将Text对象转换为String对象
*转换不成功:String str=value.toString(); str=new String(str.getBytes(),"编码格式");
应该直接value.getBytes(), 再转换格式*/
byte[] bt = value.getBytes();
//因为红楼梦的所有txt为gbk编码格式
String str = new String(bt, "gbk");
Reader read = new BufferedReader(new StringReader(str));
IKSegmenter iks = new IKSegmenter(read, true);
&n