分级聚类(Hierarchical Clustering)
1 描述
通过连续不断地将最为相似的群组两两合并来构造出一个群组的层级结构,其中每个群组都是从单一元素开始。每次迭代过程中,分级聚类算法会计算每两个群组间的距离,并将距离最近的两个群组合并为一个新的群组,直到只剩一个群组为止。
2 Python预备知识
(1) 文件数据读取
file(filename[, mode[, bufsize]]) 为file类型的构造函数,内置了open()函数,一般对文件操作不建议直接采用
Constructor function for the file type, described further in section File Objects. The constructor’s arguments are the same as those of the open() built-in function described below.
When opening a file, it’s preferable to use open() instead of invoking this constructor directly. file is more suited to type testing (for example, writing isinstance(f, file)).
注意:当filename为路径时,可能出现“IOError: [Errno 22] invalid mode ('r') or filename:”,由于路径中包含‘\’导致路径出现问题,通过在文件路径前加'r'来解决‘/’带来的影响
(2) 文本数据整理
pList=string.strip().split('\t')字符串去除('\t'或者空格)处理为list类型
Return a copy of the string with leading and trailing characters removed. If chars is omitted or None, whitespace characters are removed. If given and not None, chars must be a string; the characters in the string will be stripped from the both ends of the string this method is called on.
-
string.
split
(
s
[,
sep
[,
maxsplit
]
]
)将s分割为以word为单位的list
-
Return a list of the words of the string s. If the optional second argument sep is absent or None, the words are separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed). If the second argument sep is present and not None, it specifies a string to be used as the word separator. The returned list will then have one more item than the number of non-overlapping occurrences of the separator in the string. The optional third argument maxsplit defaults to 0. If it is nonzero, at most maxsplit number of splits occur, and the remainder of the string is returned as the final element of the list (thus, the list will have at most maxsplit+1 elements).
The behavior of split on an empty string depends on the value of sep. If sep is not specified, or specified as None, the result will be an empty list. If sep is specified as any string, the result will be a list containing one element which is an empty string.
(3) 类定义
class bicluster:
def__init__(self,vec,left=None,right=None,distance=0.0,id=None):
self.left=left//左孩子节点
self.right=right//右孩子节点
self.distance=distance
self.id=id
self.vec=vec
3 分级聚类算法描述
其实就是形成一棵二叉树
形成过程:
while len>1
遍历每一个配对,寻找最小距离
将最小距离的一组形成新的聚类数据
删除配对数据,将新聚类数据加入
打印:
利用二叉树遍历方式进行递归遍历
4 利用PIL绘制树状图
本文介绍了一种基于连续合并最相似群组的分级聚类算法,该算法通过不断合并距离最近的两个群组来构建层级结构。文章详细解释了算法的工作原理,并提供了Python实现的基本步骤,包括文件数据读取、文本数据整理及类定义等。
455

被折叠的 条评论
为什么被折叠?



