python中的homebrewing dbscan-优快云博客

本文介绍了一个基于密度的空间聚类算法DBSCAN的自制实现过程。包括关键概念定义、伪代码设计、时间与空间复杂度评估等内容。

Density-based spatial clustering for applications with noise, DBSCAN, is one mouthful of a clustering algorithm. Created in 1996, it has withstood the test of time and is still one of the most useful approaches to clustering data points today. For fun, and to broaden my horizons, I took a stab brewing up my own DBSCAN class in python. If you are the type to skip to the end of the book, you can see my project in its entirety here.

针对具有噪声的应用程序，基于密度的空间聚类DBSCAN是一类聚类算法。它创建于1996年，经受了时间的考验，仍然是当今对数据点进行聚类的最有用的方法之一。为了娱乐和开阔视野，我花了一点功夫用python编写了自己的DBSCAN类。如果您是跳到本书结尾的类型，则可以在此处完整地查看我的项目。

它是如何工作的？ (How does it work?)

Clustering in DBSCAN is determined by categorizing data points into three types based on the relationship of a point to its neighboring points. The three types of points are Core points, Border points, and Noise points. A cluster is made by identifying a core point, determining if its neighboring points are core points or border points. A cluster is only expanded beyond a single point’s neighborhood when the neighboring points are themselves, core points.

DBSCAN中的聚类是通过根据点与其相邻点之间的关系将数据点分为三种类型来确定的。三种类型的点是核心点，边界点和噪声点。通过标识核心点，确定其相邻点是核心点还是边界点来构成聚类。当相邻点本身就是核心点时，群集只会扩展到单个点附近。

Image for post — https://commons.wikimedia.org/w/index.php?curid=17045963 https ：//commons.wikimedia.org/w/index.php？curid = 17045963

Consider the above 2-dimensional points. Point A is an example of a core point. A core point, in this instance, is made when 4 total points are within a neighborhood. Point B and C are examples of border points. Border points may be in the neighborhood of a core point, but have fewer than 4 total points in their own neighborhood. Point N is an example of a noise point. It doesn’t belong to a cluster or has enough points in its neighborhood.

考虑上述二维点。点A是核心点的示例。在这种情况下，当总共有4个点位于一个邻域内时，便形成一个核心点。点B和C是边界点的示例。边界点可能在核心点附近，但在自己的邻域中总共少于4个点。点N是噪声点的示例。它不属于群集，或者在其附近具有足够的点。

深入了解 (Deeper Understanding)

To fully understand a problem, I try to define a list of key concepts, write some loose pseudo code of the process, and spend a little time identifying time and space complexity.

为了完全理解问题，我尝试定义一些关键概念，编写一些宽松的过程伪代码，并花一些时间来确定时间和空间的复杂性。

定义 (Definitions)

Minimum Sample — A user defined minimum number of points

最小样本数—用户定义的最小点数

Epsilon — A user defined distance between two points

Epsilon-用户定义的两点之间的距离

Neighborhood — A space surrounding a central point, within epsilon distance.

邻域—围绕中心点的空间，在epsilon距离之内。

Core Point — A point that has a minimum sample of points, within it’s neighborhood.

核心点-在其邻域内具有最少点样本的点。

Border Point — A point that belongs to at least one Core point’s neighborhood, but does not have the minimum sample of points within its own neighborhood.

边界点-至少属于一个核心点邻域的一个点，但是在其自身邻域内没有最少的点样本。

Noise point — A point that neither belongs to a Core point’s neighborhood nor has the minimum sample of points within its own neighborhood.

噪声点-既不属于核心点邻域，也不具有其自身邻域内的最小点样本的点。

伪码 (Pseudocode)

Evaluation

评价

时间复杂度 (Time Complexity)

Each point is being compared to every other point in our data set so I can expect O(n²) time complexity. Since I am only adding two new features per point, cluster and point type, our space complexity is O(2n) rounded to O(n).

我们正在将每个点与数据集中的每个其他点进行比较，因此我可以预期O(n²)的时间复杂度。由于我仅为每个点添加两个新特征(聚类和点类型)，因此我们的空间复杂度将O(2n)舍入为O(n)。

关键观察 (Key Observations)

Clusters are made up of the neighborhoods of core points. So I know I need some modification of the nearest neighbors search function. Below is a snippet of my modified nearest neighbors search I settled on from my own class implementation.

聚类由核心点的邻域组成。所以我知道我需要对最近邻居搜索功能进行一些修改。以下是我从自己的类实现中确定的经过修改的最近邻居搜索的摘要。

Distance is critical for figuring out the neighborhood of a point. Most datasets have more than 2 features that you may want to consider while clustering. Since I want to work with n-dimensional data points, I created a distance helper method to allow me the flexibility to work with multiple distance formulas. Working with the L2 norm opened up my neighborhood method from strictly being 2-dimensional to being fully n-dimensional.

距离对于找出点的附近至关重要。大多数数据集都具有两个以上的特征，您在聚类时可能要考虑这些特征。由于我想使用n维数据点，因此我创建了一种距离帮助器方法，以使我能够灵活地处理多个距离公式。使用L2范数使我的邻域方法从严格的二维变为完全的n维。

用单元测试定义行为 (Defining Behavior with Unit Tests)

When starting with any new code, I like to set up a battery of unit tests to help define my user contracts. These can be modified as I identify any bad assumptions I made, but generally speaking, I define a test and leave it. This helps keep me consistent and ensures I’m fulfilling my promises I make to the user.

当以任何新代码开头时，我喜欢设置一系列单元测试来帮助定义我的用户合同。当我确定自己做出的任何错误假设时，可以对其进行修改，但是通常来说，我定义了一个测试并保留了该测试。这有助于使我保持一致，并确保我履行对用户的承诺。

Though I can’t predict every way a user might use my class, I can at least identify a few things I don’t want a user to do. The first three tests are all about shaping usage. The first ensures that users don’t instantiate my DBSCAN class with impossible or breaking values. The second test is to ensure that homogenous numerical data is the only data passed in. The third is to ensure that users don’t attempt to call a method before data has been passed in. Lastly, the fourth test is a very basic functional test, making sure that any changes I make, don’t break basic functionality.

尽管我无法预测用户使用班级的所有方式，但至少可以确定一些我不希望用户做的事情。前三个测试都是关于整形用法的。第一个确保用户不会实例化我的DBSCAN类的值是不可能的或不正确的。第二项测试是确保唯一传递的数据是同类数值数据。第三项是确保用户在传递数据之前不要尝试调用方法。最后，第四项测试是非常基本的功能测试，请确保我进行的任何更改都不会破坏基本功能。

As I implement, I tend to grow my unit tests at the same time. You can see the full battery of unit tests I settled on here.

在实施时，我倾向于同时增加单元测试。您可以在这里看到我完成的单元测试的全部内容。

实作 (Implementation)

With well-defined unit tests and pseudocode, this is arguably the fastest and easiest part. If there is a logical problem, as there was for my very first attempt, it becomes obvious through failed tests and large gaps in pseudocode. In my initial attempt, I was consistently over-writing clusters but I was able to figure it out quickly due to my initial groundwork and change directions before investing too much time. You can see a more in-depth writeup of my initial attempt, and its shortcomings here.

有了定义明确的单元测试和伪代码，可以说这是最快，最简单的部分。如果存在逻辑上的问题(如我的第一次尝试)，则通过失败的测试和伪代码中的较大空白将变得显而易见。在最初的尝试中，我一直在覆盖群集，但是由于我最初的基础工作并在投入太多时间之前改变了方向，因此我能够很快解决它。您可以在此处看到有关我的最初尝试及其缺点的更深入的文章。

评价 (Evaluation)

At this point, all tests are passing, and my homebrewed DBSCAN is working as expected. For delivering on an MVP, that's good enough. But that is never quite enough for me, so I decided to compare it to a popular and commonly available library.

至此，所有测试都通过了，我自制的DBSCAN可以正常工作。对于实现MVP，这已经足够了。但这对我来说还远远不够，因此我决定将其与一个流行且普遍可用的库进行比较。

Ouch, that's over 800 times slower than the commonly available library. It’s tough to see that level of difference, much less admit it publicly. But after some research, it turns out my implementation is on par with other array indexing type implementations. There are a few additional techniques I can explore to improve the performance of my DBSCAN: spatial indexing and bulk neighborhood calculations. Sounds like a good topic for a follow-up article.

哎呀，这比常用库慢了800倍。很难看到这种差异程度，更不用说公开承认了。但是经过一些研究，结果证明我的实现与其他数组索引类型的实现相当。我可以探索一些其他技术来提高DBSCAN的性能：空间索引和批量邻域计算。听起来是后续文章的好话题。

My full implementation can be found on my Github. Want to chat about it? Connect with me on Linkedin.

我的完整实现可以在我的 Github 上找到 。 想聊聊吗？ 在 Linkedin 上与我联系 。

翻译自: https://towardsdatascience.com/homebrewing-dbscan-in-python-dcae23f6010