利用simhash来进行文本去重复

最新推荐文章于 2022-06-24 15:51:52 发布

原创最新推荐文章于 2022-06-24 15:51:52 发布 · 1.2w 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#token #distance #vector #文档 #string #float

本文介绍了一种名为 Charikar's Hash 的相似度检测算法，并提供了 Python 实现示例。该算法通过将文档分割成 token 并计算每个 token 的 hash 值来生成文档指纹，用于评估文档间的相似度。

部署运行你感兴趣的模型镜像

原文http://d3s.mff.cuni.cz/~holub/sw/shash/#a1

传统的hash函数能够将一样的文本生成一样的hash函数，但是，通过simhash方法，能够差不多相同的文档得到的hash函数也比较相近。

Charikar's hash

通过Charikar‘s hash，能够将比较相似度的文档得到比较相近的fingerprint。

该算法的流程如下：

    *  Document is split into tokens (words for example)

or super-tokens (word tuples)
    * Each token is represented by its hash value; a traditional

hash function is used
    * Weights are associated with tokens
    * A vector V of integers is initialized to 0, length of the vector

corresponds to the desired hash size in bits
    * In a cycle for all token's hash values (h), vector V is updated:
          o ith element is decreased by token's weight if the ith bit of

the hash h is 0, otherwise
          o ith element is increased by token's weight if the ith bit of

the hash h is 1
    * Finally, signs of elements of V corresponds to the bits of the

 final fingerprint

该hash不是将文档总体计算hash值，而是将文档中的每个token计算哈希值，对文档中每个token的hash值，按照位对hash值进行求和，如果当前token的hash值在该位上是0，则减去1，如果在该位上是1，则加上1.将所有的token按照这种方式累加，求的最终的值作为fingerprint。

python对应的代码如下：

#!/usr/bin/python

# Implementation of Charikar simhashes in Python
# See: http://dsrg.mff.cuni.cz/~holub/sw/shash/#a1

class simhash():
    def __init__(self, tokens='', hashbits=128):
        self.hashbits = hashbits
        self.hash = self.simhash(tokens)

    def __str__(self):
        return str(self.hash)

    def __long__(self):
        return long(self.hash)

    def __float__(self):
        return float(self.hash)

    def simhash(self, tokens):
        # Returns a Charikar simhash with appropriate bitlength
        v = [0]*self.hashbits

        for t in [self._string_hash(x) for x in tokens]:
            bitmask = 0
            print (t)
            for i in range(self.hashbits):
                bitmask = 1 << i
                print(t,bitmask, t & bitmask)
                if t & bitmask:
                    v[i] += 1 #查看当前bit位是否为1，是的话则将该位+1 
                else:
                    v[i] –= 1 #否则得话，该位减1

        fingerprint = 0
        for i in range(self.hashbits):
            if v[i] >= 0:
                fingerprint += 1 << i

#整个文档的fingerprint为最终各个位大于等于0的位的和
        return fingerprint

    def _string_hash(self, v):
        # A variable-length version of Python's builtin hash
        if v == "":
            return 0
        else:
            x = ord(v[0])<<7
            m = 1000003
            mask = 2**self.hashbits-1
            for c in v:
                x = ((x*m)^ord(c)) & mask
            x ^= len(v)
            if x == -1: 
                x = -2
            return x

    def hamming_distance(self, other_hash):
        x = (self.hash ^ other_hash.hash) & ((1 << self.hashbits) - 1)
        tot = 0
        while x:
            tot += 1
            x &= x-1
        return tot

    def similarity(self, other_hash):
        a = float(self.hash)
        b = float(other_hash)
        if a>b: return b/a
        return a/b

if __name__ == '__main__':
    s = 'This is a test string for testing'
    hash1 =simhash(s.split())
    #print("0x%x" % hash1)
    #print ("%s/t0x%x" % (s, hash1))

    s = 'This is a test string for testing also!'
    hash2 = simhash(s.split())
    #print ("%s/t[simhash = 0x%x]" % (s, hash2))

    print (hash1.similarity(hash2), "percent similar")
    print (hash1.hamming_distance(hash2), "bits differ out of", hash1.hashbits)