文本相似性比较方法-包括SentenceBert、聚类等

SentenceBert

简介

SentenceBERT是一种用于句子相似度评估和文本分类的深度学习模型。它的主要特点是能够将句子映射到一个固定长度的向量空间中，使得语义相似的句子尽可能靠近，而语义不相似的句子尽可能远离。
SentenceBERT的主要思想是利用BERT（一种预训练的语言模型）来对句子进行编码，并使用一个池化层来将句子编码的向量转换为一个固定长度的向量。然后，使用一个分类层来对句子的相似度进行评估。

下载安装

from sentence_transformers import SentenceTransformer, util

函数使用

util.paraphrase_mining

它将所有句子与所有其他句子进行比较，并返回一个包含具有最高余弦相似度分数的对的列表

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
# Single list of sentences - Possible tens of thousands of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']
paraphrases = util.paraphrase_mining(model, sentences)
for paraphrase in paraphrases[0:10]:
    score, i, j = paraphrase
    print("{} \t\t {} \t\t Score: {:.4f}".format(sentences[i], sentences[j], score))

结果

Do you like pizza? 		 I love pasta 		 Score: 0.6845
The cat sits outside 		 The new movie is awesome 		 Score: 0.6035
The new movie is awesome 		 The new movie is so great 		 Score: 0.5867
The new movie is awesome 		 Do you like pizza? 		 Score: 0.5748
The new movie is so great 		 I love pasta 		 Score: 0.5489
I love pasta 		 The new movie is awesome 		 Score: 0.5480
A man is playing guitar 		 The cat plays in the garden 		 Score: 0.5179
The new movie is awesome 		 The cat plays in the garden 		 Score: 0.5111
The cat sits outside 		 Do you like pizza? 		 Score: 0.4982
The new movie is so great 		 Do you like pizza? 		 Score: 0.4945

聚类

介绍

聚类（Clustering）是一种数据分析技术，用于将相似的数据点或对象分组为簇或类别。聚类的目标是发现数据中隐藏的结构和模式，并将数据点聚集到一起，形成一个整体的理解。

主流聚类方法使用

k-Means

需要事先指定好簇的数量，句子聚集在大小相等的组中

def test_kmeans():
    """
    This is a simple application for sentence embeddings: clustering
    Sentences are mapped to sentence embeddings and then k-mean clustering is applied.
    """
    embedder = SentenceTransformer(model_path3)

    # Corpus with example sentences
    corpus = ['A man is eating food.',
              'A man is eating a piece of bread.',
              'A man is eating pasta.',
              'The girl is carrying a baby.',
              'The baby is carried by the woman',
              'A man is riding a horse.',
              'A man is riding a white horse on an enclosed ground.',
              'A monkey is playing drums.',
              'Someone in a gorilla costume is playing a set of drums.',
              'A cheetah is running behind its prey.',
              'A cheetah chases prey on across a field.'
              ]
    corpus_embeddings = embedder.encode(corpus)

    # Perform kmean clustering
    num_clusters = 5
    clustering_model = KMeans(n_clusters=num_clusters)
    clustering_model.fit(corpus_embeddings