Mahout: Dirichlet clustering

本文介绍了一种基于已知数据分布模型的质量聚类方法——Dirichlet聚类。该方法通过迭代调整模型参数来拟合数据,并为每个点分配到不同模型的概率。通过检查模型间的点分配情况,可以评估数据支持的模型数量,并获得软成员资格信息。

Dirichlet clustering starts with a data set of points and a ModelDistribution. Think of ModelDistribution as a class that generates different models. You create an empty model and try to assign points to it. When this happens, the model crudely grows or shrinks its parameters to try and fit the data. Once it does this for all points, it re-estimates the parameters of the model precisely using all the points and a partial probability of the point belonging to the model.

 

At the end of each pass, you get a number of samples that contain the probabilities, models, and assignment of points to models. These samples could be regarded as clusters, and they provide information about the models and their parameters, such as their shape and size. Moreover, by examining the number of models in each sample that have some points assigned to them, you can get information about how many models (clusters) the data supports. Also, by examining how often two points are assigned to the same model, you can get an approximate measure of how likely these points are to be explained by the same model. Such soft-membership information is a side product of using  model-based clustering. Dirichlet clustering is able to capture the partial probabilities of points belonging to various models.

 

Dirichlet clustering is a powerful way of getting quality clusters using known data distribution models. In Mahout, the algorithm is a pluggable framework, so different models can be created and tested. As the models become more complex there’s a chance of things slowing down on huge data sets, and at this point you’ll have to fall back on other clustering algorithms. But after seeing the output of Dirichlet cluster-
ing, you can clearly decide whether the algorithm we choose should be fuzzy or rigid, overlapping or hierarchical, whether the distance measure should be Manhattan or cosine, and what the threshold for convergence should be. Dirichlet clustering is both a data-understanding tool and a great data clustering tool.

 

bin/mahout dirichlet 
-i mahout/reuters-vectors/tfidf-vectors 
-o mahout/reuters-dirichlet-clusters 
-k 60 -x 10 -a0 1.0 
-md org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution 
-mp org.apache.mahout.math.SequentialAccessSparseVector

 

 

 

 

 

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值