用python实现一个超大日志文件多搜索pattern搜索的需求

本文链接：https://blog.youkuaiyun.com/huapingqi/article/details/130665000

文章描述了一种处理大日志文件的方法，使用内存映射和正则表达式搜索特定模式，并按类别存储结果。如果找到相似的匹配项，它会使用Levenshtein距离算法进行比较并合并。该算法计算字符串之间的编辑距离，以确定它们的相似度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

需求：

有个非常大的日志文件，比如说10G文本文件，可能其中有非utf-8的，搜索的pattern是分类存放的，比如分类1，搜索pattern，分类2：搜索pattern

搜索结果中需按分类存放，结果中需要把匹配行以及行号

patterns = {
    "Pattern1": r"regex_pattern1",
    "Pattern2": r"regex_pattern2",
    # Add more patterns as needed
}

# Search for each pattern using regular expressions
results = {pattern_name: [] for pattern_name in patterns.keys()}
def search_patterns(file_path, patterns):
    # Open the file as a memory-mapped file
    with open(file_path, "rb") as file:
        with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mapped_file:
            # decode by ignore error, otherwise, will have exception thrown
            text = mapped_file.read().decode("utf-8", errors="ignore")
    
            for pattern_name, pattern_regex in patterns.items():
                for match in re.finditer(pattern_regex, text):
                    # Get the line number of the match, count '\n' numbers
                    line_number = text[:match.start()].count("\n") + 1
                    # Add the matched line and line number to the results dictionary
                    results[pattern_name].append((line_number, match.group(0)))
    return results

如果更进一步的需求，要求合并相似的找到的文本，如何剔除打印中的如trace的行号，log的日期后再做如下处理，那一步需要根据实际情况来

# 结果中的相似度对比，如果是相似的合并
# pip3 install python-Levenshtein
import mmap
import re
import functools
from Levenshtein import distance as lev_distance
# Search for each pattern using regular expressions
results = {pattern_name: [] for pattern_name in patterns.keys()}
def search_patterns(file_path, patterns):
    # Open the file as a memory-mapped file
    with open(file_path, "rb") as file:
        with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mapped_file:
            text = mapped_file.read().decode("utf-8", errors="ignore")
    
            for pattern_name, pattern_regex in patterns.items():
                for match in re.finditer(pattern_regex, text):
                    # Get the line number of the match
                    line_number = text[:match.start()].count("\n") + 1
                    # Create a tuple of the line number and the matched line
                    match_tuple = (line_number, match.group(0))
                    # Check if the match is similar to an existing result
                    is_similar = False
                    for result in results[pattern_name]:
                        if lev_distance(match.group(0), result[1]) <= 5: # 根据实际情况
                            # If the match is similar, add 1 to the occurrence count and do not add it to the results list
                            results[pattern_name][results[pattern_name].index(result)] = (result[0], result[1], result[2]+1)
                            is_similar = True
                    # If the match is not similar, add it to the results list with an occurrence count of 1
                    if not is_similar:
                        results[pattern_name].append((line_number, match.group(0), 1))
    return results

定义，这里就不给出来

关于文本相似度：

使用字符串距离算法比较两个文本字符串的相似性。这些算法计算两个字符串之间的相似性或差异的数值度量。有许多可用的算法，如Levenstein距离、Jaccard指数和余弦相似性。您可以使用这些算法中的任何一种，这取决于您正在比较的文本的性质和您的具体用例。

以下是如何使用Levenstein距离算法比较两个文本字符串的相似性的示例：

import Levenshtein
def compare_similarity(text1, text2):
    distance = Levenshtein.distance(text1, text2)
    similarity = 1 - (distance / max(len(text1), len(text2)))
    return similarity

在本例中，我们使用Levenstein库来计算两个文本字符串text1和text2之间的Levensteine编辑距离。然后，我们将相似度计算为1减去归一化编辑距离，其中归一化距离是编辑距离除以较长文本字符串的长度。这将返回一个介于0和1之间的值，其中1表示两个文本字符串相同，0表示它们完全不同。您可以根据您的具体用例调整相似性阈值，以确定您认为匹配的相似程度。

如果不想用Levenshtein包，自己写的话，下面是个实现，但是没有测试性能如何，比如超长字符串差异

# 手动实现lev distance
def levenshtein_distance(s, t):
    # 计算两个字符串的长度
    m, n = len(s), len(t)
    # 创建一个二维矩阵用于保存计算结果
    dp = [[0] * (n+1) for _ in range(m+1)]
    # 初始化第一行和第一列
    for i in range(m+1):
        dp[i][0] = i
    for j in range(n+1):
        dp[0][j] = j
    # 计算编辑距离
    for i in range(1, m+1):
        for j in range(1, n+1):
            cost = 0 if s[i-1] == t[j-1] else 1
            dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
    return dp[m][n]

def compare_similarity(text1, text2):
    distance = levenshtein_distance(text1, text2)
    similarity = 1 - (distance / max(len(text1), len(text2)))
    return similarity


s1 = "hello kitty"
s2 = "hello mykit"
distance = compare_similarity(s1, s2)
print(distance) # 0.6363636363636364