A Tutorial on Clustering Algorithms

本文介绍聚类这一重要的无监督学习问题,探讨了不同类型的聚类方法及其应用领域,并详细解释了四种常用聚类算法:K-means、Fuzzy C-means、层次聚类和高斯混合模型。

A Tutorial on Clustering Algorithms


Clustering: An Introduction

What is Clustering?
Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data.
A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”.
A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
We can show this with a simple graphical example:

In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering.
Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.

 

The Goals of Clustering
So, the goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.
For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).

 

Possible Applications
Clustering algorithms can be applied in many fields, for instance:

  • Marketing: finding groups of customers with similar behavior given a large database of customer data containing their properties and past buying records;
  • Biology: classification of plants and animals given their features;
  • Libraries: book ordering;
  • Insurance: identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds;
  • City-planning: identifying groups of houses according to their house type, value and geographical location;
  • Earthquake studies: clustering observed earthquake epicenters to identify dangerous zones;
  • WWW: document classification; clustering weblog data to discover groups of similar access patterns.

 

Requirements
The main requirements that a clustering algorithm should satisfy are:

  • scalability;
  • dealing with different types of attributes;
  • discovering clusters with arbitrary shape;
  • minimal requirements for domain knowledge to determine input parameters;
  • ability to deal with noise and outliers;
  • insensitivity to order of input records;
  • high dimensionality;
  • interpretability and usability.

 

Problems
There are a number of problems with clustering. Among them:

  • current clustering techniques do not address all the requirements adequately (and concurrently);
  • dealing with large number of dimensions and large number of data items can be problematic because of time complexity;
  • the effectiveness of the method depends on the definition of “distance” (for distance-based clustering);
  • if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi-dimensional spaces;
  • the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.

Clustering Algorithms

Classification
Clustering algorithms may be classified as listed below:

  • Exclusive Clustering
  • Overlapping Clustering
  • Hierarchical Clustering
  • Probabilistic Clustering

In the first case data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi-dimensional plane.
On the contrary the second type, the overlapping clustering, uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value.

Instead, a hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted.
Finally, the last kind of clustering use a completely probabilistic approach.

In this tutorial we propose four of the most used clustering algorithms:

  • K-means
  • Fuzzy C-means
  • Hierarchical clustering
  • Mixture of Gaussians

Each of these algorithms belongs to one of the clustering types listed above. So that, K-means is an exclusive clustering algorithm, Fuzzy C-means is an overlapping clustering algorithm, Hierarchical clustering is obvious and lastly Mixture of Gaussian is a probabilistic clustering algorithm. We will discuss about each clustering method in the following paragraphs.

 

Distance Measure
An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading. Figure shown below illustrates this with an example of the width and height measurements of an object. Despite both measurements being taken in the same physical units, an informed decision has to be made as to the relative scaling. As the figure shows, different scalings can lead to different clusterings.

Notice however that this is not only a graphic issue: the problem arises from the mathematical formula used to combine the distances between the single components of the data feature vectors into a unique distance measure that can be used for clustering purposes: different formulas leads to different clusterings.
Again, domain knowledge must be used to guide the formulation of a suitable distance measure for each particular application.

Minkowski Metric
For higher dimensional data, a popular measure is the Minkowski metric,

where d is the dimensionality of the data. The Euclidean distance is a special case where p=2, while Manhattan metric has p=1. However, there are no general theoretical guidelines for selecting a measure for any given application.

It is often the case that the components of the data feature vectors are not immediately comparable. It can be that the components are not continuous variables, like length, but nominal categories, such as the days of the week. In these cases again, domain knowledge must be used to formulate an appropriate measure.

 

Bibliography


来自于:http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/index.html

Next Page

源码来自:https://pan.quark.cn/s/7a757c0c80ca 《在Neovim中运用Lua的详尽教程》在当代文本编辑器领域,Neovim凭借其卓越的性能、可扩展性以及高度可定制的特点,赢得了程序开发者的广泛青睐。 其中,Lua语言的融入更是为Neovim注入了强大的活力。 本指南将深入剖析如何在Neovim中高效地运用Lua进行配置和插件开发,助你充分发挥这一先进功能的潜力。 一、Lua为何成为Neovim的优选方案经典的Vim脚本语言(Vimscript)虽然功能完备,但其语法结构与现代化编程语言相比显得较为复杂。 与此形成对比的是,Lua是一种精简、轻量且性能卓越的脚本语言,具备易于掌握、易于集成的特点。 因此,Neovim选择Lua作为其核心扩展语言,使得配置和插件开发过程变得更加直观和便捷。 二、安装与设置在Neovim中启用Lua支持通常十分简便,因为Lua是Neovim的固有组件。 然而,为了获得最佳体验,我们建议升级至Neovim的最新版本。 可以通过`vim-plug`或`dein.vim`等包管理工具来安装和管理Lua插件。 三、Lua基础在着手编写Neovim的Lua配置之前,需要对Lua语言的基础语法有所掌握。 Lua支持变量、函数、控制流、表(类似于数组和键值对映射)等核心概念。 它的语法设计简洁明了,便于理解和应用。 例如,定义一个变量并赋值:```lualocal myVariable = "Hello, Neovim!"```四、Lua在Neovim中的实际应用1. 配置文件:Neovim的初始化文件`.vimrc`能够完全采用Lua语言编写,只需在文件首部声明`set runtimepath^=~/.config/nvim ini...
基于STM32 F4的永磁同步电机无位置传感器控制策略研究内容概要:本文围绕基于STM32 F4的永磁同步电机(PMSM)无位置传感器控制策略展开研究,重点探讨在不使用机械式位置传感器的情况下,如何通过算法实现对电机转子位置和速度的精确估算与控制。文中结合STM32 F4高性能微控制器平台,采用如滑模观测器(SMO)、扩展卡尔曼滤波(EKF)或高频注入法等先进观测技术,实现对电机反电动势或磁链的实时估算,进而完成磁场定向控制(FOC)。研究涵盖了控制算法设计、系统建模、仿真验证(可能使用Simulink)以及在嵌入式平台上的代码实现与实验测试,旨在提高电机驱动系统的可靠性、降低成本并增强环境适应性。; 适合人群:具备一定电机控制理论基础和嵌入式开发经验的电气工程、自动化及相关专业的研究生、科研人员及从事电机驱动开发的工程师;熟悉C语言和MATLAB/Simulink工具者更佳。; 使用场景及目标:①为永磁同步电机驱动系统在高端制造、新能源汽车、家用电器等领域提供无位置传感器解决方案的设计参考;②指导开发者在STM32平台上实现高性能FOC控制算法,掌握位置观测器的设计与调试方法;③推动电机控制技术向低成本、高可靠方向发展。; 其他说明:该研究强调理论与实践结合,不仅包含算法仿真,还涉及实际硬件平台的部署与测试,建议读者在学习过程中配合使用STM32开发板和PMSM电机进行实操验证,以深入理解控制策略的动态响应与鲁棒性问题。
先看效果: https://pan.quark.cn/s/21391ce66e01 企业级办公自动化系统,一般被称为OA(Office Automation)系统,是企业数字化进程中的关键构成部分,旨在增强组织内部的工作效能与协同水平。 本资源提供的企业级办公自动化系统包含了详尽的C#源代码,涉及多个技术领域,对于软件开发者而言是一份极具价值的参考资料。 接下来将具体介绍OA系统的核心特性、关键技术以及在实践操作中可能涉及的技术要点。 1. **系统构造** - **三层构造**:大型OA系统普遍采用典型的三层构造,包含表现层、业务逻辑层和数据访问层。 这种构造能够有效分离用户交互界面、业务处理过程和数据存储功能,从而提升系统的可维护性与可扩展性。 2. **C#编程语言** - **C#核心**:作为开发语言,C#具备丰富的类库和语法功能,支持面向对象编程,适用于开发复杂的企业级应用。 - **.NET Framework**:C#在.NET Framework环境中运行,该框架提供了大量的类库与服务,例如ASP.NET用于Web开发,Windows Forms用于桌面应用。 3. **控件应用** - **WinForms**或**WPF**:在客户端,可能会使用WinForms或WPF来设计用户界面,这两者提供了丰富的控件和可视化设计工具。 - **ASP.NET Web Forms/MVC**:对于Web应用,可能会使用ASP.NET的Web Forms或MVC模式来构建交互式页面。 4. **数据库操作** - **SQL Server**:大型OA系统通常采用关系型数据库管理系统,如SQL Server,用于存储和处理大量数据。 - **ORM框架**:如Ent...
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值