When you want to measure the similarity between two objects including documents or images, you always encounter important questions such as “what is the right way of defining similarity?”,“how can we measure the similarity?”, and “how should we analyze the similarity metric?”. I am fascinated with the similarity concept for a long time and that is why I want to share some insights that I gained through years with you. I can not describe all the detail in an article but I try to shed a light on this topic as much as I can.
当您要测量两个对象(包括文档或图像)之间的相似性时,总是会遇到诸如“定义相似性的正确方法是什么?”,“如何测量相似性?”以及“我们应该如何分析”之类的重要问题。相似性指标?”。 我长期以来对相似性概念着迷,这就是为什么我想与您分享多年来积累的见解的原因。 我无法在一篇文章中描述所有细节,但我会尽我所能来阐明这个主题。
—除了相同的解释主题外,任何其他内容。 (— Anything less than being identical subjects to interpretation.)
We as a human have a complex system to interpret similarity. The concept of similarity, regardless of how it must be measured, varies across contexts and problems. If two objects are completely similar, there is no room for interpretation. However, anything less than being identical subjects to interpretation. An object can be anything such as a time series, a document, or an image.
作为人类,我们有一个复杂的系统来解释相似性。 相似性的概念,无论必须如何衡量,都因上下文和问题而异。 如果两个对象完全相似,则没有解释的空间。 但是,除了相同的解释主题外,没有其他内容。 对象可以是诸如时间序列,文档或图像之类的任何东西。
A human can do a lot of processing tasks such as rotation or translation according to the problem requirements if needed. However, this can not be performed by an AI solution without a series of programming and training. For example, how much you think the below images are similar if they are. If the similarity means having similar elements, they are identical; otherwise, they are completely different.
如果需要,人员可以根据问题要求执行很多处理任务,例如旋转或平移。 但是,如果没有一系列编程和培训,则AI解决方案无法执行此操作。 例如,您认为以下图像在相似程度上有多少。 如果相似性意味着具有相似的元素,则它们是相同的。 否则,它们是完全不同的。

The rabbit and duck illusion is another interesting example. An identical shape with a different angle can be interpreted differently. This shows the significance of angle or rotation in the human interpretation from an image. Therefore, you must always be aware of the sensitivity of an algorithm to the angles of the input data. You sometimes need a rotation-invariant algorithm and sometimes a rotation-variant algorithm.
兔子和鸭子的错觉是另一个有趣的例子。 具有不同角度的相同形状可以有不同的解释。 这显示了角度或旋转在人类根据图像进行解释中的重要性。 因此,您必须始终了解算法对输入数据角度的敏感性。 有时您需要旋转不变算法,有时需要旋转不变算法。
—我们根据不同的特征比较对象。 (— We compare objects based on different characteristics.)
As said above, anything less than being identical subjects to interpretation. Now, the question is “what are the main characteristics for the similarity?”. If we can properly elaborate on the similarity aspects that we want to measure, it becomes much easier to formulate. Since the first section was explained using image data, I explain this section based on text data to show the high-level concepts are applied to any type of data regardless of its type.
如上所述,除了相同的主题外,其他都需要解释。 现在,问题是“相似性的主要特征是什么?”。 如果我们能够适当地详细说明我们要衡量的相似性方面,那么表达起来就容易得多。 由于第一部分是使用图像数据进行解释的,因此我将基于文本数据来解释本部分,以表明高级概念适用于任何类型的数据,无论其类型如何。
In text-processing, we have three main types of similarity measures including lexical (or form-related), syntactical (or structure-related), and semantical (or meaning-related). Using either of these measures does not disregard the significance of other measures. Two sentences can be similar according to their form but very different according to their meaning.
在文本处理中,我们有三种主要的相似性度量,包括词汇(或与形式有关),句法(或与结构有关)和语义(或与意义有关)。 使用这些措施中的任何一个都不会忽略其他措施的重要性。 两个句子根据其形式可能相似,但根据其含义却有很大不同。
The most simple textual data is a series of letters or a string that may or may not have meaning. The similarity between two strings can be measured by different methodologies such as edit distance-based or token-based. I do not want to explain how you must implement these methodologies. What I want to emphasize is even the simplest form of similarity can be measured differently. The larger the data length, the more complex the similarity concept. Therefore, you must be prepared to encounter a complex concept of similarity with a larger length in data such as document similarity.
最简单的文本数据是一系列可能具有或没有意义的字母或字符串。 可以通过不同的方法(例如基于 编辑 距离或基于 令牌的方法)来度量两个字符串之间的相似性。 我不想解释您必须如何实现这些方法。 我要强调的是,即使是最简单的相似形式也可以用不同的方式来衡量。 数据长度越大,相似性概念越复杂。 因此,您必须准备好面对复杂的相似性概念,其中包含更长的数据长度,例如文档相似性。
In text-processing, we have three main types of similarity measures including lexical (or form-related), syntactical (or structure-related), and semantical (or meaning-related). Using either of these measures does not disregard the significance of other measures.
在文本处理中,我们有三种主要的相似性度量,包括词汇(或与形式有关),句法(或与结构有关)和语义(或与意义有关)。 使用这些措施中的任何一个都不会忽略其他措施的重要性。
—必须正确分析相似性度量的行为。 (— The behavior of the similarity metric must be analyzed properly.)
The complexity arises when you want to measure similarity using a mathematical equation. There are various methods to calculate similarity or distance between two objects that I briefly explained in the previous section. In this section, I want to describe the significance of analyzing the behavior of the similarity metric regardless of the calculation method.
当您想使用数学方程式来测量相似性时,就会产生复杂性。 我在上一节中简要介绍了多种方法来计算两个对象之间的相似度或距离。 在本节中,我想描述无论计算方法如何,分析相似性度量的行为的重要性。
In general, we can use similarity and distance interchangeably with some consideration. For example, similarity and distance are opposite of each other so if two objects are very similar it means there is a small distance between them. That is why in many cases similarity(x,y)=1 - distance(x,y) or similarity(x,y)= 1 / distance(x,y). Either of these methods imposes different sensitivity to the similarity metric that becomes crucial while you want to use it in a machine learning algorithm.
通常,我们可以考虑一些相似性和距离来互换使用。 例如,相似度和距离彼此相反,因此,如果两个对象非常相似,则意味着它们之间的距离很小。 这就是为什么在许多情况下,相似度(x,y)= 1-距离(x,y)或相似度(x,y)= 1 /距离(x,y)。 对于要在机器学习算法中使用的相似性指标,这两种方法都对相似性指标施加不同的敏感性。
A small change in each dimension of the data point that is represented by a vector may significantly skew the vector in high-dimensional space in contrary to two-dimensional space where the vector does not relatively skew.
与矢量不相对偏斜的二维空间相反,由矢量表示的数据点的每个维度的小变化可能会使矢量在高维空间中明显偏斜。
A similarity metric works differently based on the dimension of space where objects are represented at. For example, if the cosine similarity between two objects in a two-dimensional space is 0.9 you may call those points similar. However, you may not be able to drive the same conclusion in a high-dimensional space with that threshold. Plus, if cos(x, y) = cos(x, z) = 0.9 you barely can make any useful conclusion on the relationship of y and z in a high-dimensional space. On the contrary, you can draw useful conclusions in a two-dimensional space based on spatial geometry. These observations show that you must use the cosine similarity differently based on the space dimension.
相似度度量依据表示对象的空间尺寸而不同。 例如,如果二维空间中两个对象之间的余弦相似度为0.9,则可以将这些点称为相似点。 但是,您可能无法在具有该阈值的高维空间中得出相同的结论。 另外,如果cos(x,y)= cos(x,z)= 0.9,那么您几乎无法对高维空间中y和z的关系做出任何有用的结论。 相反,您可以在基于空间几何的二维空间中得出有用的结论。 这些观察结果表明,您必须根据空间维度以不同的方式使用余弦相似度。
最后一个字 (The Last Word)
In machine learning, especially in clustering techniques, you must constantly work with similarity or distance metrics. For example, one of the main steps in the clustering techniques is to determine the cluster that a new point belongs to. In this scenario, you must calculate the similarity between the new data point and, for example, the cluster centroid and compare it with a threshold.
在机器学习中,尤其是在聚类技术中,您必须不断使用相似性或距离度量。 例如,聚类技术的主要步骤之一是确定新点所属的聚类。 在这种情况下,您必须计算新数据点与(例如)群集质心之间的相似度,并将其与阈值进行比较。
You can simply use Euclidean distance or cosine distance in many applications. However, it is highly suggested to deeply learn the similarity concept if you want to become an expert in the data science field. I can not say this will solely solve all of your problems but it is an important tool in your toolbox.
您可以在许多应用程序中简单地使用欧几里德距离或余弦距离。 但是,如果您想成为数据科学领域的专家,强烈建议您深入学习相似性概念。 我不能说这将完全解决您的所有问题,但这是您工具箱中的重要工具。
We sometimes start implementing complex methodologies and spend less on learning their high-level descriptions. It is brilliant to be able to write code fast to, for example, calculate document similarity or extract document topics. However, if we do not know the underlying details we may be misled in the future. I tried to shed light on deep learning in the following article. I hope you enjoy it. Kudos 😊
我们有时会开始实施复杂的方法,而花更少的钱学习它们的高级描述。 能够快速编写代码,例如计算文档相似度或提取文档主题,这是很棒的。 但是,如果我们不了解基本细节,将来可能会被误导。 在下一篇文章中,我试图阐明深度学习。 我希望你喜欢它。 荣誉😊
翻译自: https://towardsdatascience.com/are-these-similar-enough-a7466d4a745c