需求:
有个非常大的日志文件,比如说10G文本文件,可能其中有非utf-8的,搜索的pattern是分类存放的,比如分类1,搜索pattern,分类2:搜索pattern
搜索结果中需按分类存放,结果中需要把匹配行以及行号
patterns = {
"Pattern1": r"regex_pattern1",
"Pattern2": r"regex_pattern2",
# Add more patterns as needed
}
# Search for each pattern using regular expressions
results = {pattern_name: [] for pattern_name in patterns.keys()}
def search_patterns(file_path, patterns):
# Open the file as a memory-mapped file
with open(file_path, "rb") as file:
with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mapped_file:
# decode by ignore error, otherwise, will have exception thrown
text = mapped_file.read().decode("utf-8", errors="ignore")
for pattern_name, pattern_regex in patterns.items():
for match in re.finditer(pattern_regex, text):
# Get the line number of the match, count '\n' numbers
line_number = text[:match.start()].count("\n") + 1
# Add the matched line and line number to the results dictionary
results[pattern_name].append((line_number, match.group(0)))
return results
如果更进一步的需求,要求合并相似的找到的文本,如何剔除打印中的如trace的行号,log的日期后再做如下处理,那一步需要根据实际情况来
# 结果中的相似度对比,如果是相似的合并
# pip3 install python-Levenshtein
import mmap
import re
import functools
from Levenshtein import distance as lev_distance
# Search for each pattern using regular expressions
results = {pattern_name: [] for pattern_name in patterns.keys()}
def search_patterns(file_path, patterns):
# Open the file as a memory-mapped file
with open(file_path, "rb") as file:
with mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as mapped_file:
text = mapped_file.read().decode("utf-8", errors="ignore")
for pattern_name, pattern_regex in patterns.items():
for match in re.finditer(pattern_regex, text):
# Get the line number of the match
line_number = text[:match.start()].count("\n") + 1
# Create a tuple of the line number and the matched line
match_tuple = (line_number, match.group(0))
# Check if the match is similar to an existing result
is_similar = False
for result in results[pattern_name]:
if lev_distance(match.group(0), result[1]) <= 5: # 根据实际情况
# If the match is similar, add 1 to the occurrence count and do not add it to the results list
results[pattern_name][results[pattern_name].index(result)] = (result[0], result[1], result[2]+1)
is_similar = True
# If the match is not similar, add it to the results list with an occurrence count of 1
if not is_similar:
results[pattern_name].append((line_number, match.group(0), 1))
return results
定义,这里就不给出来
关于文本相似度:
使用字符串距离算法比较两个文本字符串的相似性。这些算法计算两个字符串之间的相似性或差异的数值度量。有许多可用的算法,如Levenstein距离、Jaccard指数和余弦相似性。您可以使用这些算法中的任何一种,这取决于您正在比较的文本的性质和您的具体用例。
以下是如何使用Levenstein距离算法比较两个文本字符串的相似性的示例:
import Levenshtein
def compare_similarity(text1, text2):
distance = Levenshtein.distance(text1, text2)
similarity = 1 - (distance / max(len(text1), len(text2)))
return similarity
在本例中,我们使用Levenstein库来计算两个文本字符串text1和text2之间的Levensteine编辑距离。然后,我们将相似度计算为1减去归一化编辑距离,其中归一化距离是编辑距离除以较长文本字符串的长度。这将返回一个介于0和1之间的值,其中1表示两个文本字符串相同,0表示它们完全不同。您可以根据您的具体用例调整相似性阈值,以确定您认为匹配的相似程度。
如果不想用Levenshtein包,自己写的话,下面是个实现,但是没有测试性能如何,比如超长字符串差异
# 手动实现lev distance
def levenshtein_distance(s, t):
# 计算两个字符串的长度
m, n = len(s), len(t)
# 创建一个二维矩阵用于保存计算结果
dp = [[0] * (n+1) for _ in range(m+1)]
# 初始化第一行和第一列
for i in range(m+1):
dp[i][0] = i
for j in range(n+1):
dp[0][j] = j
# 计算编辑距离
for i in range(1, m+1):
for j in range(1, n+1):
cost = 0 if s[i-1] == t[j-1] else 1
dp[i][j] = min(dp[i-1][j] + 1, dp[i][j-1] + 1, dp[i-1][j-1] + cost)
return dp[m][n]
def compare_similarity(text1, text2):
distance = levenshtein_distance(text1, text2)
similarity = 1 - (distance / max(len(text1), len(text2)))
return similarity
s1 = "hello kitty"
s2 = "hello mykit"
distance = compare_similarity(s1, s2)
print(distance) # 0.6363636363636364