mahout聚类实例

最新推荐文章于 2021-12-30 14:44:37 发布

asin929

最新推荐文章于 2021-12-30 14:44:37 发布

阅读量2.2k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Mahout 文章标签： mahout 聚类

本文链接：https://blog.youkuaiyun.com/u012948976/article/details/50263343

Mahout 专栏收录该内容

3 篇文章

订阅专栏

本文详述了使用Mahout进行数据预处理、聚类以及查看结果的过程。首先介绍了数据准备，强调每行数据代表一组特征值。接着，讨论了数据预处理的步骤，包括如何将数据转换为RandomAccessSparseVector并解释了Key值的含义。进一步提供了修改InputMapper.java以改变Key为行id的方法。然后，展示了如何执行聚类操作，包括设置输入、输出路径、初始中心点、分类个数和最大迭代次数。最后，解释了结果查看的内容，包括聚类ID、中心坐标、半径和每个类包含的点的详细信息，并演示了如何将结果保存为CSV文件和进行聚类结果评价。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据准备

一般的初始数据如下所示，每行代表一组特征值。

数据预处理

mahout org.apache.mahout.clustering.conversion.InputDriver -i /user/hdfs/cluster_all/test4/text -o /user/hdfs/cluster_all/test4/text-vector

可以在上述命令后添加参数“-v option ”转为需要的向量格式。比如org.apache.mahout.math.RandomAccessSparseVector”表示转为RandomAccessSparseVector。

使用seqdumper查看转换后的数据，

mahout seqdumper -i /user/hdfs/cluster_all/test4/text-vector/part-m-00000 -o ./text-vector

结果为：

Key: 3: Value: {0:28.0,1:38.0,2:88.0}
Key: 3: Value: {0:88.0,1:88.0,2:88.0}
Key: 3: Value: {0:8.0,1:88.0,2:89.0}
Key: 3: Value: {0:8.0,1:78.0,2:80.0}
Count: 4

特别注意；上述的Key值并不是指的每行的id，指的是每行的特征数。查看源码mahout–InputMapper.java。发现mahout在处理每行数据时，由于无法得到行id，便将每行的特征数作为Key输出，所以出现了上述情况。

数据预处理（修改版）

准备数据

重新准备数据，使得向量化后的Key为该行的id。
准备数据如下，第一列为id，后几列为特征。第一列与后几列用Tab分隔，特征值间用空格分开。

ip  28 88 38
ip2 88 88 88
ip3 8 88 89

修改向量化方式

修改InputMapper.java中两处内容，如下所示

String[] line = values.toString().split("\t");
String id = line[0];
String characters = line[1];
String[] numbers = SPACE.split(characters);

context.write(new Text(id), vectorWritable);

编译该文件并将mahout-examples-0.9-job.jar中相应class替换掉（可参考修改mahout的分类指标）。运行效果如下所示，

Key: ip: Value: {0:28.0,1:88.0,2:38.0}
Key: ip2: Value: {0:88.0,1:88.0,2:88.0}
Key: ip3: Value: {0:8.0,1:88.0,2:89.0}
Key: ip4: Value: {0:8.0,1:78.0,2:80.0}
Count: 4

聚类

 mahout kmeans -i /user/hdfs/cluster_all/test4/text-vector -o /user/hdfs/cluster_all/test4/kmeans  -c /user/hdfs/cluster_all/test4/initialcenter --maxIter 5 -k 2 -cl

-i：输入向量路径
-o ：聚类结果路径
-c /user/hdfs/cluster_all/test4/initialcenter：初始中心点
-k 2：分类个数
–maxIter 5 ：最大迭代数次数

结果查看

mahout clusterdump -i /user/hdfs/cluster_all/test4/kmeans/clusters-3-final -o cluster.csv -p /user/hdfs/cluster_all/test4/kmeans/clusteredPoints

输出效果如下，

VL-2{n=1 c=[88.000, 88.000, 88.000] r=[0:37.712, 2:0.471]}
        Weight : [props - optional]:  Point:
        1.0 : [distance=0.0]: ip2 = [88.000, 88.000, 88.000]
VL-3{n=3 c=[14.667, 84.667, 69.000] r=[9.428, 4.714, 22.226]}
        Weight : [props - optional]:  Point:
        1.0 : [distance=1149.8888888888869]: ip = [28.000, 88.000, 38.000]
        1.0 : [distance=455.55555555555475]: ip3 = [8.000, 88.000, 89.000]
        1.0 : [distance=209.8888888888905]: ip4 = [8.000, 78.000, 80.000]

VL-2表示聚类id，n=1表示类中点的个数，c=[88.000, 88.000, 88.000]表示中心坐标，r=[0:37.712, 2:0.471]是聚类半径，后面的是每类包含点的详细信息（因为我们使用-p参数指定了所有点分类结果文件的位置）。

可以在上述命令后添加参数“-of CSV”将结果另存为csv文件，

mahout clusterdump -i /user/hdfs/cluster_all/test4/kmeans/clusters-3-final -o cluster.csv -of CSV -p /user/hdfs/cluster_all/test4/kmeans/clusteredPoints

结果如下，每行分别为聚类id及所包含的点id。

2,ip2
3,ip,ip3,ip4

另外，上述命令还有很多参数可选，比较重要的一个是“-e”，表示对聚类结果评价。

mahout clusterdump -i /user/hdfs/cluster_all/test4/kmeans/clusters-3-final  -p /user/hdfs/cluster_all/test4/kmeans/clusteredPoints  -e

结果如下，

Inter-Cluster Density: NaN
Intra-Cluster Density: 0.4968738815456574
CDbw Inter-Cluster Density: 0.0
CDbw Intra-Cluster Density: 0.09522621793444969
CDbw Separation: 11499.777777777774