Clustering Uncertain Data Based on Probability Distribution Similarity
IEEE Transanctions on knowledge and data engineering
Bin Jiang et al.
Key Points
-
Common clustering algorithms cluster objects by their coordinates, but when the object is a collection of data, the inner distribution of the object should be considered when clustering. For example, two items are both rated as 4.5 star, but the votes have very different distribution, then we may cluster them into different categories.
-
The global idea remains similar to K-mean, but here the authors use Kullback-Leibler divergence (information entropy, relative entropy) to measure the distance between distributions.
-
Use kernel density to estimate the distribution and KL divergence and use fast Gauss transform to boost the computation.
-
Use K-medoids as clustering method.
本文探讨了一种基于概率分布相似性的不确定数据聚类方法,该方法改进了传统K-means算法,利用Kullback-Leibler散度衡量分布间的距离,并采用核密度估计和快速高斯变换加速计算过程,最终使用K-medoids进行聚类。
1449

被折叠的 条评论
为什么被折叠?



