什么是Shingling算法
shingling算法用于计算两个文档的相似度,例如,用于网页去重。维基百科对w-shingling的定义如下:
In natural language processing a w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document —that can be used to gauge the similarity of two documents. The
w denotes the number of tokens in each shingle in the set.