JDPlus用python实现了DPCA,链接如下:
http://blog.youkuaiyun.com/jdplus/article/details/40351541
由于正好也要用Python做聚类,就把ta的实现改了一下,贴在下面:
def norm(iterable, n):
return sum([abs(elem) ** n for elem in iterable]) ** (1.0 / n)
def standardize(li):
minV, maxV = min(li), max(li)
return [float(item-minV)/(maxV-minV) for item in li]
def readFile(path):
try:
with open(path, 'r') as f:
return f.read()
except Exception, e:
raise e
# -*- coding: utf-8 -*-
"""
@paper:Clustering by fast search and find of density peak
@summary: 基于密度的聚类算法
K-means是通过指定聚类中心,再通过迭代的方式更新聚类中心,每个点都被指派到距离最近的聚类中心,导致其不能检测非球面类别的数据分布。
DBSCAN对于任意形状分布的进行聚类,但是必须指定一个密度阈值,从而去除低于此密度阈值的噪音点。
这篇文章假设聚类中心周围都是密度比其低的点,同时这些点距离该聚类中心的距离相比于其他聚类中心最近。
@ImplementAuthor: Shaobo
Created on Mon Oct 20 13:38:18 2014
@ModifyAuthor: Psmlbj
changed on 2016/05/26
"""
from math import exp
import random
import matplotlib.pyplot as plt
import numpy as np
import psmfile as pF
import psmMath as pM
MAX = 1000000000
def readData(filePath, rowSep, colSep, specCols={}):
# keyCol is primary key 's column,labelCol is classlabel's column
# pointsFilter is used to filter rows to get what we want
lines = pF.readFile(filePath).decode('utf-8')
lines = lines.split(rowSep)
linesMat = [row.split(colSep) for row in lines]
#