Mahout: Dirichlet clustering

最新推荐文章于 2025-12-31 21:08:34 发布

原创最新推荐文章于 2025-12-31 21:08:34 发布 · 138 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#人工智能

Mahout 专栏收录该内容

25 篇文章

订阅专栏

本文介绍了一种基于已知数据分布模型的质量聚类方法——Dirichlet聚类。该方法通过迭代调整模型参数来拟合数据，并为每个点分配到不同模型的概率。通过检查模型间的点分配情况，可以评估数据支持的模型数量，并获得软成员资格信息。

Dirichlet clustering starts with a data set of points and a ModelDistribution. Think of ModelDistribution as a class that generates different models. You create an empty model and try to assign points to it. When this happens, the model crudely grows or shrinks its parameters to try and fit the data. Once it does this for all points, it re-estimates the parameters of the model precisely using all the points and a partial probability of the point belonging to the model.

At the end of each pass, you get a number of samples that contain the probabilities, models, and assignment of points to models. These samples could be regarded as clusters, and they provide information about the models and their parameters, such as their shape and size. Moreover, by examining the number of models in each sample that have some points assigned to them, you can get information about how many models (clusters) the data supports. Also, by examining how often two points are assigned to the same model, you can get an approximate measure of how likely these points are to be explained by the same model. Such soft-membership information is a side product of using model-based clustering. Dirichlet clustering is able to capture the partial probabilities of points belonging to various models.

Dirichlet clustering is a powerful way of getting quality clusters using known data distribution models. In Mahout, the algorithm is a pluggable framework, so different models can be created and tested. As the models become more complex there’s a chance of things slowing down on huge data sets, and at this point you’ll have to fall back on other clustering algorithms. But after seeing the output of Dirichlet cluster-
ing, you can clearly decide whether the algorithm we choose should be fuzzy or rigid, overlapping or hierarchical, whether the distance measure should be Manhattan or cosine, and what the threshold for convergence should be. Dirichlet clustering is both a data-understanding tool and a great data clustering tool.

bin/mahout dirichlet 
-i mahout/reuters-vectors/tfidf-vectors 
-o mahout/reuters-dirichlet-clusters 
-k 60 -x 10 -a0 1.0 
-md org.apache.mahout.clustering.dirichlet.models.GaussianClusterDistribution 
-mp org.apache.mahout.math.SequentialAccessSparseVector