import Levenshtein as lvst
编辑距离计算相似度
Levenshtein Distance
def leven_distance(s1, s2):
dis = lvst.distance(s1, s2)
# 1-它们的距离/两个字符串长度的最大值
sim = 1 - dis/max(len(s1), len(s2))
return sim
#Dice系数计算相似度
def dice_distance(s1, s2):
s1 = set(s1)
s2 = set(s2)
overlap = len(s1 & s2)
sim = overlap * 2.0/(len(s1) + len(s2))
return sim
Jaccard系数计算相似度
def jaccard_distance(s1, s2):
s1 = set(s1)
s2 = set(s2)
sim = len(s1 & s2)/len(s1 | s2)
return sim
if name == “main”:
s1 = ‘kitten’
s2 = ‘sitting’
s3 = ‘年纪’
s4 = ‘年龄’
res1 = leven_distance(s3, s4)
res = jaccard_distance(s3, s4)
print(res)
本文介绍了几种常用的文本相似度计算方法,包括编辑距离(Levenshtein Distance)、Dice系数和Jaccard系数,通过具体代码示例展示了如何计算两个字符串之间的相似度。

被折叠的 条评论
为什么被折叠?



