Mahout: K-means clustering

本文详细介绍了K-means算法的工作原理及其参数设置方法,并探讨了如何使用Canopy预聚类来估算K-means中初始聚类数量k的最佳值。文中还提供了使用Mahout实现这两种算法的具体步骤。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

K-means Algorithm

The k-means algorithm will start with an initial set of k centroid points. The algorithm does multiple rounds of processing and refines the centroid locations until the iteration max-limit criterion is reached or until the centroids converge to a fixed point from which they don’t move very much.


 

K-generation

  • centroids are randomly generated using RandomSeedGenerator

For good quality clustering using k-means, you’ll need to estimate a value for k. An approximate way of estimating k is to figure it out based on the data you have and the size of clusters you need. In the preceding example, where we have about a million news articles, if there are an average of 500 news articles published about every unique story, you should start your clustering with a k value of about 2,000 (1,000,000/500).

 

  • Finding the perfect k using canopy clustering

Canopy clustering’s strength lies in its ability to create clusters extremely quickly—it can do this with a single pass over the data. But its strength is also its weakness. This algorithm may not give accurate and precise clusters. But it can give the optimal number of clusters without even specifying the number of clusters, k, as required by k-means.

The algorithm uses a fast distance measure and two distance thresholds, T1 and T2, with T1 > T2. It begins with a data set of points and an empty list of canopies, and then iterates over the data set, creating canopies in the process. During each iteration, it removes a point from the data set and adds a canopy to the list with that point as the center. It loops through the rest of the points one by one. For each one, it calculates the distances to all the canopy centers in the list. If the distance between the point and any canopy center is within T1, it’s added into that canopy. If the distance is within T2, it’s removed from the list and thereby prevented from forming a new canopy in the subsequent loops. It repeats this process until the list is empty.
This approach prevents all points close to an already existing canopy (distance < T2) from being the center of a new canopy. It’s detrimental to form another redundant canopy in close proximity to an existing one.



 

 

mahout canopy 
-i reuters-vectors/tfidf-vectors \
-o reuters-canopy-centroids \
-dm org.apache.mahout.common.distance.EuclideanDistanceMeasure \
-t1 1500 -t2 2000

 

 

mahout kmeans 
-i mahout/reuters-vectors/tfidf-vectors 
-o mahout/reuters-kmeans-clusters 
-dm org.apache.mahout.common.distance.TanimotoDistanceMeasure 
-c mahout/reuters-canopy-centroids/clusters-0-final 
-cd 0.1 
-ow 
-x 20 
-cl

 

 

 

 mahout clusterdump 
-i mahout/reuters-kmeans-clusters/clusters-1-final 
-o mahout/reuters-kmeans-clusters-dump 
-dt sequencefile 
-d mahout/reuters-vectors/dictionary.file-* 
-p mahout/reuter-kmeans-clusters/clusteredPoints
-b 10 
-n 10

 mahout clusterdump usage



 

 Note: you must add -cl in kmean command otherwise there has no clusteredPoints dir. See

http://lucene.472066.n3.nabble.com/where-are-the-points-in-each-cluster-kmeans-clusterdump-tc838683.html#none

 

 

 

 

 

 

 

 

 

 

 

 

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值