通过大量短信内容查找其中短信的模板

破坏地板

已于 2025-04-02 14:36:24 修改

阅读量559

点赞数 4

文章标签： python 开发语言

于 2023-07-14 17:41:10 首次发布

本文链接：https://blog.youkuaiyun.com/m0_57141312/article/details/131728507

版权

文章讲述了在处理大量数据时，从使用SequenceMatcher导致电脑死机，转向采用MinHashLSH实现速度显著提升的过程。作者遇到模板总数超过数据总数的问题，并解决了这一问题。同时强调了调整相似度阈值的重要性，以及最终模板需人工修改以去除可变部分。预处理步骤包括去除特殊字符和统一大小写，以提高匹配效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

最开始使用的是SequenceMatcher来查找，之后改用MinHashLSH，速度有显著提升。

注意：
1.相似度阈值适当
2.最后输出的模板还需要人工修改，把其中可变部分去掉变为可以匹配的特殊字符，我使用的是.*，可以使用contains()
3.文本最好统一大小写
4.如果内容长短不一，可以考虑计算字符长度，然后对于每个长度区间设置不同的相似度阈值

import csv

#导入datasketch库的MinHash和MinHashLSH类，用于计算相似度和创建LSH索引
from datasketch import MinHash, MinHashLSH
import pandas as pd


def preprocess_text(text):
    return text.split()


def find_templates(data, similarity_threshold):
    # 创建LSH
    lsh = MinHashLSH(threshold=similarity_threshold, num_perm=128)
    templates = {}
    minhashes = {}
    for i, content in enumerate(data):
        preprocessed_content = preprocess_text(content)
        # 创建MinHash对象
        m = MinHash(num_perm=128)
        for word in preprocessed_content:
            m.update(word.encode('utf8'))
        # 检查是否已有相似的模板
        result = lsh.query(m)
        if result:  # 如果LSH中已经存在相似的，就增加计数
            templates[result[0]] += 1  # 只增加最相似的一个的计数
        else:  # 否则，添加为新的模板
            lsh.insert(content, m)
            templates[content] = 1
            minhashes[content] = m
    return templates


#打开文件并读取数据
with open('D:/file_p/data_ms_daxie.csv', 'r', encoding='utf-8') as file:
    reader = csv.DictReader(file)
    content_list = [row['content'].upper() for row in reader]


#调用`find_templates`函数，传入`content_list`和`similarity_threshold`作为参数，获得模板字典`templates`。
similarity_threshold = 0.65
templates = find_templates(content_list, similarity_threshold)


sorted_templates = sorted(templates.items(), key=lambda x: x[1], reverse=True)
df = pd.DataFrame(sorted_templates, columns=['template', 'count'])