基于MapReduce框架的K-means算法实现

最新推荐文章于 2023-10-31 14:15:39 发布

日拱一卒的Alex

最新推荐文章于 2023-10-31 14:15:39 发布

阅读量4k

点赞数 3

分类专栏： MapReduce 大数据数据挖掘十大算法文章标签： JAVA Hadoop mapreduce k-means 并行处理

本文链接：https://blog.youkuaiyun.com/u012808902/article/details/77396905

版权

大数据同时被 3 个专栏收录

13 篇文章

订阅专栏

MapReduce

8 篇文章

订阅专栏

数据挖掘十大算法

2 篇文章

订阅专栏

1. K-means算法的非形式化描述

非定一个N个对象的集合，要将这些对象分组到K个簇中，k-means算法需要完成以下

步骤：

1）将N个对象划分到K个非空子集。

2）计算当前分区中心的簇质心（质心是这个簇的中心点或平均点）。

3）将各个对象分配到有最近质心的簇。

4）如果不在有新的分配，则停止计算。否则返回步骤2.。

这个算法会反复迭代，直到质心不再发生改变，此时就找到了我们想要的K个簇。

2.K-means均值距离函数

采用欧式距离，即设二维平面上两点a(x1,y1)与b(x2,y2)间的欧氏距离为：

3. MapReduce的解决方案

1.main函数读取质心文件

2. 将质心的字符串放到configuration中

3. 在mapper类重写setup方法，获取到configuration的质心内容，解析成二维数组的形式，代表质心

4. mapper类中的map方法读取样本文件，跟所有的质心比较，得出每个样本跟哪个质心最近，然后输出<质心，样本>

5. reducer类中重新计算质心，如果重新计算出来的质心跟进来时的质心一致，那么自定义的counter加1

6. main中获取counter的值，看是否等于质心数量，如果不相等，那么继续迭代，否则退出

3.1 预备阶段（读取簇质心文件）

分两种读取，一种是第一次读取客户给定的簇质心文件，另一种是读取reduce输出的簇质心文件

public class Center {
	 protected static int k = 2;     //质心的个数  ，每次都输出两个质心
     
	 //拿到初始的保存在hdfs文件中的初始质心
	 public String loadInitCenter(Path path) throws Exception{
		 StringBuffer sb = new StringBuffer();
		 
		 Configuration conf = new Configuration();
		 FileSystem fs = FileSystem.get(new URI("hdfs://example:9000"),conf,"hadoop");
		 FSDataInputStream din = fs.open(path); //往目标文件上兑一根输入流
		 LineReader in = new LineReader(din,conf);//包装
		 
		 Text line = new Text();
		 while(in.readLine(line) > 0){//读到的一行数据放入line对象中，若其长度大于0
			 sb.append(line.toString().trim());//则保存进buffer中
			 sb.append("\t");//用\t间隔
		 }
		 return sb.toString().trim();
	 }
	 
	 //拿到后来reduce重新生成的质心
	 public String loadCenter(Path path)throws Exception{
		//sb中保存每个文件中的k个质心，每个质心用\t隔开
		 StringBuffer sb = new StringBuffer(); 
		 
		 Configuration conf = new Configuration();
		 FileSystem fs = FileSystem.get(conf);

		 //拿到reduce输出目录下的所有文件
		RemoteIterator<LocatedFileStatus> files = fs.listFiles(path, false);
		while(files.hasNext()){
			LocatedFileStatus lfs = files.next();
			//过滤掉非簇质心所在文件
			if(!lfs.getPath().getName().contains("part"))
				continue;
			
			 FSDataInputStream din = fs.open(lfs.getPath());
			 LineReader in = new LineReader(din);
			 Text line = new Text();
			 while(in.readLine(line) > 0){
				 sb.append(line.toString().trim());
				 sb.append("\t");
			 }
		}
		return sb.toString().trim();
	 }
	 
	 
}

3.2 Mapper阶段

1）在预处理阶段去拿到给定的初始簇质心（setup方法）

2）对文件中读到的每行“向量” 拿出来计算与每个质心之间的距离

3）保存与输入点有最小距离的簇质心

4）输出键是离输入点最近的簇质心值是该向量

static class K_meansMapper extends Mapper<LongWritable, Text, Text, Text>{
		String centerStrArray[] = null; //每个元素代表一个簇质心d维坐标的字符串，之间用“，”隔开
		double centers[][] = new double[Center.k][];//每个元素代表一个簇质心在d维空间中某一维的坐标
		//预处理，收集初始簇质点；
		@Override
		protected void setup(Context context)
				throws IOException, InterruptedException {
			//得到上一轮聚类后的簇质心
			String centerSource = context.getConfiguration().get(FLAG);
			System.out.println(centerSource);
			centerStrArray = centerSource.split("\t"); //得到所有簇质心组成的字符串数组
			
			for(int i=0;i<centerStrArray.length;i++){
				String centerStr[] = centerStrArray[i].split(",");//得到每个质心的维度坐标的字符串数组
				centers[i] = new double[centerStr.length];
				for(int j=0;j<centerStr.length;j++){
					centers[i][j] = Double.parseDouble(centerStr[j]);
				}
			}
		}
		
		//
		@Override
		protected void map(LongWritable key, Text value,Context context)
				throws IOException, InterruptedException {
			String line = value.toString();
			String vector[] = line.split(",");
			double sample[] = new double[vector.length];
			for(int i=0;i<vector.length;i++){
				sample[i] = Double.parseDouble(vector[i]);
			}
			
			double min = Double.MAX_VALUE;//记录最小距离
			int index = 0;//记录最小距离的簇质点
			//计算每个输入点与簇质心的距离，并且找出距离当前点的最近簇质心
			for(int i=0;i<centers.length;i++){
				double d = distance(sample,centers[i]);
				if(min > d){
					min = d;
					index = i;
				}
			}
			//输出<簇质点，向量>
			context.write(new Text(centerStrArray[index]), value);
		}
	}

3.3 Combiner阶段

各个映射任务之后，会应用combiner来组合映射任务的中间数据。组合器将累加向量对象的各个维的值，并计算当前的平均值。Combine()函数是在map阶段输出后，临时内存中数据溢出时开始执行，故其相当于是在本地做了合并，然后将合并的值通过网络传输给reduce，这样一来就可以充分的减少网络传输流量，从而提高算法的执行效率。

static class K_meansCombinner extends Reducer<Text, Text, Text, Text>{
	@Override
	protected void reduce(Text key, Iterable<Text> values,Context context)
				throws IOException, InterruptedException {
		int len = key.toString().split(",").length;
		double center[] =  new double[len];
		int size = 0;
		
		Iterator<Text> iterator = values.iterator();
		while(iterator.hasNext()){
			String centerStr[] = iterator.next().toString().split(",");
			for(int i=0;i<len;i++){
				center[i] += Double.parseDouble(centerStr[i]);
			}
			size++;
		}
		
		StringBuffer sb = new StringBuffer();
		for(int i=0;i<center.length;i++){
			center[i] /= size;
			sb.append(center);
			sb.append(",");
		}
		sb.deleteCharAt(sb.toString().length() - 1);
		
		context.write(key, new Text(sb.toString()));
	}
}

3.4 Reducer阶段

1）重新计算簇中心

2）每个归约器迭代处理各个值向量，计算其平均值。将计算好的平均值当做下一个簇中心，并输出

3）比较新质点与老质点，若小于阙值则将自定义的counter加1

	static class K_meansReducer extends Reducer<Text, Text, Text, NullWritable>{
		Counter counter = null;
		@Override
		protected void reduce(Text key, Iterable<Text> values,Context context)
				throws IOException, InterruptedException {
			int len = key.toString().split(",").length;
			double newCenter[] = new double[len]; //保存新生成的簇中心
			
			int size = 0; //记录传过来的簇中有多少向量
			for(Text value : values){
				String centerStr[] = value.toString().split(",");//拿到所有d维空间中的“点”信息
				for(int i=0;i<centerStr.length;i++){//将其对应的空间轴坐标累加，方便后面求均值
					newCenter[i] += Double.parseDouble(centerStr[i]);
				}
				size++;
			}
			
			//由StringBuffer保存的新的聚类簇的质心坐标
			StringBuffer sb = new StringBuffer();
			for(int i=0;i<newCenter.length;i++){
				newCenter[i] /= size;//求平均值
				sb.append(newCenter[i]);
				sb.append(",");
			}
			sb.deleteCharAt(sb.toString().length()-1);
			
			//拿到由map传过来的上一轮产生的簇质心坐标
			String oldCenterStr[] = key.toString().split(",");
			double oldCenter[] = new double[oldCenterStr.length];
			for(int i=0;i<oldCenterStr.length;i++){
				oldCenter[i] = Double.parseDouble(oldCenterStr[i]);
			}
			
			//新质心同老质心比是否发生变化
			boolean flag = changed(oldCenter,newCenter);  
			
			//若有变化则将计数器+1, 代表已经由一个最终簇的质心确定
			if(flag){
				//第一个是计数器组的名称，第二是计数器的名称
				counter = context.getCounter("myCounter", "kmenasCounter");
				counter.increment(1l);
			}
			context.write(new Text(sb.toString().trim()), NullWritable.get());
		}
	}

3.5 辅助方法

主要包括两个，一个是判断新老质心的改变是否收敛，另一个是欧氏距离函数

	//两组质心的改变是否收敛
	public static boolean changed(double oldCenter[],double newCenter[]){
		for(int i=0;i<oldCenter.length;i++){
			if(oldCenter[i] - newCenter[i] > 0.0000001){
				return false;
			}
		}
		return true;
	}
	
	//欧氏距离
	public static double distance(double center[],double data[]){
		double sum = 0;
		for(int i=0;i<center.length;i++){
			sum += Math.pow(center[i]-data[i], 2);
		}
		return Math.sqrt(sum);
	}

3.5 main方法

1）决定输出输入目录

2）读入文件中的簇质心信息，并将其转换为字符串放入configuration中

3）通过自定义的counter控制迭代次数

	public static void main(String[] args) throws Exception {
		Path inputPath = new Path("/kmeans/input");
		Path centerPath = new Path("/kmeans/output/center.txt");
		Center center = new Center();
		
		String centerStr = center.loadInitCenter(centerPath); //拿到初始化的质心
		
		int index = 0;
		while(true){
			Configuration conf = new Configuration();
			conf.set(FLAG, centerStr);//将其放入Configuration中
			
			//将初始化的质心目录改为reduce的输出目录，也是下一轮的质心所在目录
			centerPath = new Path("/kmeans/output"+index);
			
			Job job = Job.getInstance(conf, "kmeans" + index);
			job.setJarByClass(K_means.class);
			
			job.setMapperClass(K_meansMapper.class);
			job.setReducerClass(K_meansReducer.class);
			job.setCombinerClass(K_meansCombiner.class);
			
			job.setMapOutputKeyClass(Text.class);
			job.setMapOutputValueClass(Text.class);
			job.setOutputKeyClass(Text.class);
			job.setOutputValueClass(NullWritable.class);
			
			FileInputFormat.setInputPaths(job, inputPath);
			FileOutputFormat.setOutputPath(job, centerPath);
			//提交
			job.waitForCompletion(true);
			
			//★★★★
			//获取自定义counter的大小，若等于k值则说明已经得到最终结果
			//"myCounter", "kmenasCounter"
			Counter counter = job.getCounters().getGroup("myCounter").findCounter("kmenasCounter");
			long countValue= counter.getValue();
			if(countValue == Center.k)
				System.exit(0);
			else{
				//若程序未退出，则重新加载reduce输出的新质心 ，此时不再用初始化加载
				counter.setValue(0l);
				centerStr = center.loadCenter(centerPath);
				index++;
			}
		}
	}