hadoop之测试KMeans(二)：输出结果分析

最新推荐文章于 2022-04-14 10:23:08 发布

zstarstone

最新推荐文章于 2022-04-14 10:23:08 发布

阅读量4.9k

点赞数

CC 4.0 BY-SA版权

分类专栏： Big Data 原创 JAVA

本文链接：https://blog.youkuaiyun.com/ShiZhixin/article/details/8984048

本文详细解析了Hadoop MapReduce实现的KMeans聚类算法的输出结果和执行流程。通过分析，了解到程序在Map阶段选择最近的中心点并输出，Reduce阶段进行中心点的更新。最终输出迭代次数为5次，展示了KMeans的运行过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

上次给出了hadoop之测试KMeans(一)：运行源码实例，这次来分析一下整个MapReduce的输出结果。测试数据文件依然是文一中提到的15组数据：

(20,30) (50,61) (20,32) (50,64) (59,67)(24,34) (19,39) (20,32) (50,65) (50,77) (20,30) (20,31) (20,32) (50,64) (50,67)

先上一张我理解的这个程序的一个流程图，尤其注意数据<key, value>的输入输出方面。

现在开始分析输出结果，中间用--***--的是我在程序中加的println输出的

--main::start--//开始进入KMeans中的Main函数

--CenterInitial::run--//开始进入CenterInitial.java，初始化聚类中心操作：CenterInitial centerInitial = new CenterInitial();

CenterInitial::The initial centeris:(50,61) (50,64) (20,30)//初始时随机选择K个不同的中心点，存入HDFS中的center文件中

//初始化完成后启动job,进入Map-->Reduce过程

13/05/28 11:31:33 WARNutil.NativeCodeLoader: Unable to load native-hadoop library for yourplatform... using builtin-java classes where applicable

13/05/28 11:31:33 WARN mapred.JobClient:Use GenericOptionsParser for parsing the arguments. Applications shouldimplement Tool for the same.

13/05/28 11:31:33 WARN mapred.JobClient: Nojob jar file set. User classes may notbe found. See JobConf(Class) or JobConf#setJar(String).

13/05/28 11:31:33 INFOinput.FileInputFormat: Total input paths to process : 1

13/05/28 11:31:33 WARN snappy.LoadSnappy:Snappy native library not loaded

13/05/28 11:31:33 INFO mapred.JobClient:Running job: job_local_0001

13/05/28 11:31:33 INFO util.ProcessTree:setsid exited with exit code 0

13/05/28 11:31:33 INFO mapred.Task: Using ResourceCalculatorPlugin :org.apache.hadoop.util.LinuxResourceCalculatorPlugin@6754d6

13/05/28 11:31:33 INFO mapred.MapTask:io.sort.mb = 100

13/05/28 11:31:33 INFO mapred.MapTask: databuffer = 79691776/99614720

13/05/28 11:31:33 INFO mapred.MapTask:record buffer = 262144/327680

//进入KMapper.java, 首先调用的是setup函数，完成开始初始化聚类中心的数据读入，存入KMapper类全局变量center中，至于为什么程序会自动调用setup函数，在hadoop API的文档中有说明：

/* The framework first calls setup(org.apache.hadoop.mapreduce.Mapper.Context),
followed by map(Object, Object, Context) for each key/value pair in the InputSplit.
Finally cleanup(Context) is called.*/

--Mapper::setup--start--

--Mapper::setup--end--

--Mapper::map--start--//setup函数结束后，调用map函数，这里通过调试可以看出，map的输入参数<key, value> = <0, 文件cluster的15组数据>，系统默认的读入<key, value>，通过map函数处理，输出的<key, value>对如下：

center[pos]:(20,30)outvalue:(20,30)

center[pos]:(50,61)outvalue:(50,61)

center[pos]:(20,30)outvalue:(20,32)

center[pos]:(50,64)outvalue:(50,64)

center[pos]:(50,64)outvalue:(59,67)

center[pos]:(20,30)outvalue:(24,34)

center[pos]:(20,30)outvalue:(19,39)

center[pos]:(20,30)outvalue:(20,32)

center[pos]:(50,64)outvalue:(50,65)

center[pos]:(50,64)outvalue:(50,77)

center[pos]:(20,30)outvalue:(20,30)

center[pos]:(20,30)outvalue:(20,31)

center[pos]:(20,30)outvalue:(20,32)

center[pos]:(50,64)outvalue:(50,64)

center[pos]:(50,64)outvalue:(50,67)

//从输出可以看出，输出的<key, value>对的value值是15个数据点，其key值是对应的到所有中心距离最小的中心值

--Mapper::map--end--//map结束

13/05/28 11:31:33 INFO mapred.MapTask:Starting flush of map output

13/05/28 11:31:33 INFO mapred.MapTask:Finished spill 0

13/05/28 11:31:33 INFO mapred.Task:Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting

13/05/28 11:31:34 INFOmapred.JobClient: map 0% reduce 0%

13/05/28 11:31:36 INFOmapred.LocalJobRunner:

13/05/28 11:31:36 INFO mapred.Task: Task'attempt_local_0001_m_000000_0' done.

13/05/28 11:31:36 INFO mapred.Task: Using ResourceCalculatorPlugin :org.apache.hadoop.util.LinuxResourceCalculatorPlugin@78bc3b

13/05/28 11:31:36 INFOmapred.LocalJobRunner:

13/05/28 11:31:36 INFO mapred.Merger:Merging 1 sorted segments

13/05/28 11:31:36 INFO mapred.Merger: Downto the last merge-pass, with 1 segments left of total size: 272 bytes

13/05/28 11:31:36 INFOmapred.LocalJobRunner:

--KReducer::reduce--start--(20,30)org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@12c3327 //开始reduce操作，从这里可以看出，这个reduce的输入参数的key是(20,30)这个聚类中心，value是这个聚类中心对应的map中计算的距离最小的8个数据点，一共有三聚类中心，有三个reduce

(20.375,32.5)//KReduce结束前的这组数据的新的中心

key:(20,30)outval+center:(20,30) (20,32)(24,34) (19,39) (20,32) (20,30) (20,31) (20,32) (20.375,32.5)//这是reduce的输出<key, value>

--KReducer::reduce--end--//我理解这个reduce是把输入key中对应的数据点进行合并，分为三个reduce进行合并，如果进行调试也可以看出分三个中心的reduce进行分别处理

--KReducer::reduce--start--(50,61)org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@12c3327//如上，第二个reduce过程，合并中心为(50,61)的数据点

(50.0,61.0)

key:(50,61)outval+center:(50,61)(50.0,61.0)

--KReducer::reduce--end--

--KReducer::reduce--start--(50,64)org.apache.hadoop.mapreduce.ReduceContext$ValueIterable@12c3327//如上，第三个reduce过程，合并中心为(50,61)的数据点

(51.5