Distance And Dissimilarity

Distance And Dissimilarity

 

本文收集的用来描述空间距离的公式。注意有些并不能称为是Distance,因其不满足欧式空间的不等式特性;但是这些公式在某些程度上也描述了空间中向量的差异性,所以被称为是Dissimilarity。本文暂不区分DistanceDissimilarity,统一用Distance看待,方便描述;各位看官心中要有这些区分。

 

对于空间向量U和V,其中U=[u1,u2,...ui,...un].T,V=[v1,v2,...vi,...vn].T,U和V是空间n维向量,T表示转置为列向量。U和V的Distance可以有如下表示方法,各个方法的侧重点是不一样的;如何选择合适的Distance,就看各位的功力了。

braycurtis: the Bray-Curtis distance.

 

canberra: the Canberra distance.

 

chebyshev: the Chebyshev distance.

 

cityblock: the Manhattan distance.

 

correlation: the Correlation distance.

 

cosine: the Cosine distance.

 

dice: the Dice dissimilarity.

 

euclidean: the Euclidean distance.

 

hamming: the Hamming distance.

 

jaccard: the Jaccard distance.

 

kulsinski: the Kulsinski distance.

 

mahalanobis: the Mahalanobis distance.

 

matching: the matching dissimilarity.

 

minkowski: the Minkowski distance.

 

rogerstanimoto: the Rogers-Tanimoto dissimilarity.

 

russellrao: the Russell-Rao dissimilarity.

 

sokalmichener: the Sokal-Michener dissimilarity.

 

sokalsneath: the Sokal-Sneath dissimilarity.

 

sqeuclidean: the squared Euclidean distance.

 

yule: the Yule dissimilarity.

 

解释:

Cff:表示U和V在相同维度上的值都是False的个数

Cft:表示U和V在相同维度上的值分别为False、True的个数

Ctt:表示U和V在相同维度上的值都是True的个数

Ctf:表示U和V在相同维度上的值分别为True、False的个数

上面的这四个数值的输入向量应该为True或者False,如果不为True、False类型的话,就需要转换一下才能使用到相关度量。

 

上面提到的Distance度量有些很常见,如chebyshev、Cityblock、euclidean、cosine等,有些Distance度量则是显得专业了许多。不过这些Distanced度量都是从某种程度上对Vector的相似性进行衡量,如果使用到某种度量,对于公式又不熟的话,可以查询下。更为细节的Distance度量的场景使用,则还需要根据具体的业务来决定。

 

 上面提到的这些Distance度量在专业的数学工具里都有实现,如matlab、R等,我这里提供一个scipy的实现,http://docs.scipy.org/doc/scipy-dev/reference/spatial.distance.html#module-scipy.spatial.distance,大家可以看下。实际上对于上面的Distance提到的公式,相信实现对于各位来说也不是问题。

 

Metric learning. The choice of a suitable metric is crucial for the design of a successful pattern recognition system. The literature on the subject is there fore substantial. Some existing similarity measures are hand crafted (e.g., [7,8]). Alternatively, there is growing interest in methods which apply learning tech niques to t similarity measures and metrics to available training data (see [9] for a comprehensive study). Most common to these techniques is the learning of a projection matrix from the data so that the Euclidean distance can perform better in the new subspace. Learning such a matrix is equivalent to learning a Mahalanobis distance in the original space. The Relevant Component Analysis (RCA) method of Bar-Hillel et al. [10] is one such example. They learn a full rank Mahalanobis metric by using equiva lence constraints on the training elements. Goldberger et al. [11] described the Neighborhood Component Analysis (NCA) approach for k-NN classi cation. NCA works by learning a Mahalanobis distance minimizing the leave-one-out cross-validation error of the k-NN classi er on a training set. Another method, designed for clustering by [12], also learns a Mahalanobis distance metric, here using semi-de nite programming. Their method attempts to minimize the sum of squared distances between examples of the same label, while preventing the distances between di erently labeled examples from falling below a lower bound. In [13] a Large Margin Nearest Neighbor (LMNN) method was proposed, which employed semi-de nite learning to obtain a Mahalanobis distance metric for which any collection of k-nearest neighbors always has the same class label. Additionally, elements with di erent labels were separated by large margins. The Information Theoretic Metric Learning (ITML) approach of Davis et al. [14] solves a Bregmans optimization problem [15] to learn a Mahalanobis distance function. The result is a fast algorithm, capable of regularization by a known prior matrix, and is applicable under di erent types of constraints, includ ing similarity, dissimilarity and pair-wise constraints. The Online Algorithm for Scalable Image Similarity (OASIS) [16] was proposed for online metric learning for sparse, high-dimensional elements. Unlike the previously mentioned approaches, the recent method of Nguyen and Bai [17] attempts to learn a cosine similarity, rather than learning a metric for the Euclidean distance. This was shown to be particularly e ective for pair matching of face images on the Labeled Faces in the Wild (LFW) benchmark [1, 18]. Similarities employing background information. The rst similarity mea sure in a recent line of work, designed to utilize background-information, is the One-Shot-Similarity (OSS) of [3,2]. Given two vectors I and J, their OSS score is computed by considering a training set of background sample vectors N. This set of vectors contains examples of items di erent from both I and J, but are otherwise unlabeled. We review the OSS score in detail in Sec. 2.1. This OSS has been shown to be key in amplifying performance on the LFW data set. Here, we extend the OSS approach by deriving a metric learning scheme for emphasizing the separation between same and not-same vectors when compared using the OSS
最新发布
09-26
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值