Unsupervised Classification - Sprawl Classification Algorithm

本文介绍了一种名为蔓延分类的聚类算法,该算法通过设定最小距离d和最小连接点数来寻找并划分数据集中的不同簇。算法首先将所有数据加载到总缓存中,并逐步搜索每个簇内的临近点,直至所有数据被分类。文章还讨论了算法的优化方向,如特征距离检查和数据区域划分,以减少计算复杂度。

Idea

Points (data) in same cluster are near each others, or are connected by each others.
So:

  • For a distance d,every points in a cluster always can find some points in the same cluster.
  • Distances between points in difference clusters are bigger than the distance d.
    The above condition maybe not correct totally, e.g. in the case of clusters which have common points, the condition will be incorrect.
    So need some improvement.

Sprawl Classification Algorithm

  • Input:
    • data: Training Data
    • d: The minimum distance between clusters
    • minConnectedPoints: The minimum connected points:
  • Output:
    • Result: an array of classified data
  • Logical:
Load data into TotalCache.
i = 0
while (TotalCache.size > 0) 
{
    Find a any point A from TotalCache, put A into Cache2.
    Remove A from TotalCache
    In TotalCache, find points 'nearPoints' less than d from any point in the Cache2.
    Put Cache2 points into Cache1.
    Clear Cache2.
    Put nearPoints into Cache2.
    Remove nearPoints from TotalCache.
    if Cache2.size = 0, add Cache1 points into Result[i].
    Clear Cache1.
    i++
}
Return Result

Note: As the algorithm need to calculating the distances between points, maybe need to normalize data first to each feature has same weight.

Improvement

A big problem is the method need too much calculation for the distances between points. The max times is \(/frac{n * (n - 1)}{2}\).

Improvement ideas:

  • Check distance for one feature first maybe quicker.
    We need not to calculate the real distance for each pair, because we only need to make sure whether the distance is less than \(d\),
    if points x1, x2, the distance will be bigger or equals to \(d\) when there is a $ \vert x1[i] - x2[i] \vert \geqslant d$.
  • Split data in multiple area
    For n dimensions (features) dataset, we can split the dataset into multiple smaller datasets, each dataset is in a n dimension space whose size \(d^{n}\).
    We can image that each small space is a n dimensions cube and adjoin each other.
    so we only need to calculate points in the current space and neighbour spaces.

Cons

  • Need a amount of calculating.
  • Need to improve to handle clusters which have common points.

转载于:https://www.cnblogs.com/steven-yang/p/5764718.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值