原文http://d3s.mff.cuni.cz/~holub/sw/shash/#a1
传统的hash函数能够将一样的文本生成一样的hash函数,但是,通过simhash方法,能够差不多相同的文档得到的hash函数也比较相近。
Charikar's hash
通过Charikar‘s hash,能够将比较相似度的文档得到比较相近的fingerprint。
该算法的流程如下:
* Document is split into tokens (words for example)
or super-tokens (word tuples)
* Each token is represented by its hash value; a traditional
hash function is used
* Weights are associated with tokens
* A vector V of integers is initialized to 0, length of the vector
corresponds to the desired hash size in bits
* In a cycle for all token's hash values (h), vector V is updated:
o ith element is decreased by token's weight if the ith bit of
the hash h is 0, otherwise
o ith element is increased by token's weight if the ith bit of
the hash h is 1
* Finally, signs of elements of V corresponds to the bits of the
final fingerprint
该hash不是将文档总体计算hash值,而是将文档中的每个token计算哈希值,对文档中每个token的hash值,按照位 对hash值进行求和,如果当前token的hash值在该位上是0,则减去1,如果在该位上是1,则加上1.将所有的token按照这种方式累加,求的最终的值作为fingerprint。
python对应的代码如下:
#!/usr/bin/python # Implementation of Charikar simhashes in Python # See: http://dsrg.mff.cuni.cz/~holub/sw/shash/#a1 class simhash(): def __init__(self, tokens='', hashbits=128): self.hashbits = hashbits self.hash = self.simhash(tokens) def __str__(self): return str(self.hash) def __long__(self): return long(self.hash) def __float__(self): return float(self.hash) def simhash(self, tokens): # Returns a Charikar simhash with appropriate bitlength v = [0]*self.hashbits for t in [self._string_hash(x) for x in tokens]: bitmask = 0 print (t) for i in range(self.hashbits): bitmask = 1 << i print(t,bitmask, t & bitmask) if t & bitmask: v[i] += 1 #查看当前bit位是否为1,是的话则将该位+1 else: v[i] –= 1 #否则得话,该位减1 fingerprint = 0 for i in range(self.hashbits): if v[i] >= 0: fingerprint += 1 << i
#整个文档的fingerprint为最终各个位大于等于0的位的和
return fingerprint
def _string_hash(self, v):
# A variable-length version of Python's builtin hash
if v == "":
return 0
else:
x = ord(v[0])<<7
m = 1000003
mask = 2**self.hashbits-1
for c in v:
x = ((x*m)^ord(c)) & mask
x ^= len(v)
if x == -1:
x = -2
return x
def hamming_distance(self, other_hash):
x = (self.hash ^ other_hash.hash) & ((1 << self.hashbits) - 1)
tot = 0
while x:
tot += 1
x &= x-1
return tot
def similarity(self, other_hash):
a = float(self.hash)
b = float(other_hash)
if a>b: return b/a
return a/b
if __name__ == '__main__':
s = 'This is a test string for testing'
hash1 =simhash(s.split())
#print("0x%x" % hash1)
#print ("%s/t0x%x" % (s, hash1))
s = 'This is a test string for testing also!'
hash2 = simhash(s.split())
#print ("%s/t[simhash = 0x%x]" % (s, hash2))
print (hash1.similarity(hash2), "percent similar")
print (hash1.hamming_distance(hash2), "bits differ out of", hash1.hashbits)
上述代码运行的结果如下:
0.97584098811 percent similar
14 bits differ out of 128
本文介绍了一种名为 Charikar's Hash 的相似度检测算法,并提供了 Python 实现示例。该算法通过将文档分割成 token 并计算每个 token 的 hash 值来生成文档指纹,用于评估文档间的相似度。
550

被折叠的 条评论
为什么被折叠?



