日志高精度提取规则智能体框架的初尝试 LogParserX: Qwen-2.5-72B-instruct (any LLMs) + CrewAI, 如何超光速从无到万物生？

WHOAREIMMORTALS

已于 2025-03-12 18:05:05 修改

阅读量1k

点赞数 20

文章标签： python 安全数据分析

于 2025-03-12 18:01:45 首次发布

本文链接：https://blog.youkuaiyun.com/daxuanzi515/article/details/145742666

版权

LogParserX

这是我给这个辣鸡框架起的名字，不懂，也不想懂为什么叫这个很奇葩的名字！
~~能不能一枪爱死这个题啊？~~
我在写代码的心情belike: 👻👻👻
在这里插入图片描述

前置知识

以下是可能用到的知识，不过不知道也无所谓，边看边问AI也不是不行。

正则表达式
日志基本结构
AI智能体 (~~不是强化学习的智能体~~)
科技 AND 中转API平台
conda环境和CPU

任务背景

给定一系列日志放入json格式中，如以下结构：

{
	{
        "logId": 371,
        "logText": "<128>April 25 19:52:48 2013 apt APT~2~1~2013-04-25 17:28:02~192.168.58.200:36720~192.168.58.102:80~WEB攻击~异常文件扩展名字访问~NULL~中~1304251728020000001~ NULL~POST /apt/files?action=add",
        "logField": [
            {
                "key": "",
                "value": "April 25 19:52:48 2013"
            },
            {
                "key": "",
                "value": "APT"
            },
            {
                "key": "",
                "value": "2013-04-25 17:28:02"
            },
            {
                "key": "",
                "value": "192.168.58.200:36720"
            },
            {
                "key": "",
                "value": "WEB攻击"
            },
            {
                "key": "",
                "value": "异常文件扩展名字访问"
            }
        ]
    }
}

需要根据logText部分生成一些高精度规则用于提取logField部分。规则格式不限，但是必须要能够用于提取无logField的日志明文。某比赛说要求使用Qwen-2.5-72b-instruct API交互。
~~那我问你，那我问你，为什么不提供免费API给我做？（愤怒地）~~

要求分成两个阶段，一个智能体学习阶段A，一个规则提取阶段B：
A生成高精度提取规则，B使用这个规则对验证集进行提取。
A阶段输入含logField的日志集合，输出一系列高精度规则+使用API对话的所有轨迹记录；
B阶段要求输入不含logField的日志集合+A阶段的高精度规则，输出含logField的日志集合。

环境准备

我在两台设备上都装了环境, 这里只放Ubuntu20.04的，都需要安装docker用于代码沙箱测试。
其他的都默认安装，最后参考requirements.txt

python >= 3.12.9 (硬性要求 >=3.12.0)
crewai==0.102.0
crewai_tools==0.36.0
langchain_openai==0.3.6
litellm==1.61.9
pydantic==2.10.6
python-dotenv==1.0.1
VSCode 最新版 (fittenCode 补全很强但是干扰性更强 / CodeGeeX 适中补全)

任务分析

一看这要求，要求你使用智能体框架，并且更新高精度规则，那么很自然地就想到分成几个不同部分的智能体并分配工作。

实验对比

官方标准：

字段抽取正确: 解析结果中字段的 key 和 value 与人工打标的对应字段相等, 则认为字段抽取正确。
匹配日志: 对于每条日志, 人工打标的结果中至少有一个字段抽取正确, 认为该条日志能被规则匹配。
完全匹配日志: 对于每条日志,人工打标的结果中所有字段均抽取正确, 认为该条日志能被规则完全匹配。

基于字段抽取和日志匹配情况, 可以计算匹配率与完全正确率进行最终打分:

匹配率 = (匹配日志数 / 总日志数) * 100%
完全正确率 = (完全匹配日志数 / 总日志数) * 100%

metric = 匹配率 * 0.4 + 完全正确率 * 0.6

按照道理来说完全匹配率越高越好，但实际上的情况是不可能完全匹配，则是尽可能输出多的key-value对用于计算覆盖率，就是尽量覆盖参考值的内容，尽管有些不属于参考值。就是Soundness的概念，尽可能输出多的解决方案来覆盖检查点Truth*，尽管有错，但是包含Truth的值。

人工生成正则筛选器

通过人工分类得到默认的正则表达式列表，用于几种不同类型的日志进行提取分析。这里我分别选取几种不同范围的日志来检测我人工手动建立的知识库的覆盖率：
Coverage: 指我生成的部分能覆盖标准值的比例均值
Matched: 有key-value命中的比例
Perfect_Macthed: 完全一致的比例
实验表格如下：

Index	Coverage(Mine)	Matched(Official)	Perfect_Macthed(Official)	70%< Coverage Count(Mine)
0 - 5	91.0%	100.0%	0.0%	0
0 - 10	91.0%	100.0%	0.0%	0
0 - 100	84.6%	100.0%	1.0%	15
0 - 400	79.6%	98.2%	0.2%	106
100 - 200	81.1%	99.0%	0.0%	27
120 - 125	92.0%	100.0%	0.0%	0
200 - 300	71.8%	99.0%	0.0%	40
300 - 400	71.0%	99.0%	0.0%	39

我的预编译知识库：
A.pattern.py 手工正则用于提供学习资料

key_value_p = r"""
        (?:                        # 起始分隔符检测
        (?<=[;,:,=(\-])|       # 关键修正：添加冒号:和连字符-作为合法分隔符
        ^)
        \s*                        # 允许前置空格
        (?P<key>                   # 键名规则
            (?![\d\-])             # 不能以数字或连字符开头
            [\w\s.-]+              # 允许字母/数字/空格/点/连字符
        )
        \s*=\s*                    # 等号两侧允许空格
        (?P<value>                 # 值部分
            (?:                   
                (?!\s*[,;)=\-])    # 排除前置分隔符（新增-）
                [^,;)=\-]+         # 基础匹配（新增排除-）
            )+
        )
        (?=                        # 截断预查
            \s*[,;)=\-]|           # 分隔符（新增-）
            \s*$|                  # 字符串结束
            (?=\S+\s*=)            # 后面紧跟新键（含空格键名）
        )
    """
# 时间：不带年份+带年份
date_p = r"\b[A-Za-z]{3}\s{1,2}\d{1,2}\s\d{4}\s\d{2}:\d{2}:\d{2}\b"
date_p_ = r"""\b([A-Za-z]+ \d{1,2} \d{4} \d{2}:\d{2}:\d{2})\b"""
date_p_2 = r"([A-Za-z]{3})\s+ (\d{1,2})\s+(\d{4})\s+(\d{2}):(\d{2}):(\d{2})([+-]\d{2}):(\d{2})"
date_p_3 = r"(\d{4}-\d{1,2}-\d{1,2} \d{2}:\d{2}:\d{2}(?:[+-]\d{2}:\d{2})?)"
# 主机名字
hostname_p = r"(?<=:\d{2}) ([a-zA-Z0-9._-]+)*(?=\s)"
# 进程ID
pid_p = r"([a-zA-Z0-9_-]+)\[(\d+)\]"
pid_p_2 = r"(\S+)\s+\[(.*?)\]"
# 端口号
# from {ip} port {port}
ip_port_p = r"(\d+\.\d+\.\d+\.\d+)\s+port\s+(\d+)"
# ip(port)
ip_port_p_2 = r"(\d+\.\d+\.\d+\.\d+)(?:\((\d+)\))?"
# ip:port
ip_port_p_3 = r"(\d|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d{2}|2[0-4]\d|25[0-5]):([0-9]|[1-9]\d|[1-9]\d{2}|[1-9]\d{3}|[1-5]\d{4}|6[0-4]\d{3}|65[0-4]\d{2}|655[0-2]\d|6553[0-5])$"
# 会话ID
session_p = r"session (\d+)"
# 函数调用
function_p = r"(?!%%.*)([a-zA-Z0-9_-]+)\((.*?)\)"
# 90-09-10-20
WebPort_p = r"(\d{1,3}-\d{1,3}-\d{1,3}-\d{1,3})"
# XXX/YYYY 
slash_pattern = r"([^,/]+)\/([^,]+)"
# user-agent
user_agent_p = r"Mozilla/5\.0\s*\([^)]+\)\s*(?:AppleWebKit/[\d\.]+\s*\([^)]+\)\s*Chrome/[\d\.]+\s*Safari/[\d\.]+|[\w\s]+/[\d\.]+)"
# HTTP响应码
HTTPS_code_p = r"HTTP/S响应码/(\d+)"
# attack info
web_attack_p = r"WEB攻击~([^~]+)~([^~]*)~([中高低]+)"
sys_attack_p = r"系统告警~+([^~]*)~+([^~]*)~+([中高低]+)~+(\d+)"
# json_str
json_str_p = r'''
    "([^"]+)"            # 键
    \s*:\s*              # 分隔符
    (                    # 值
        "(?:\\"|[^"])*"  # 字符串（支持转义）
        |$$.*?$$         # 数组
        |-?\d+           # 整数
        |-?\d+\.\d+      # 浮点数
        |true|false|null # 布尔/空值
    )'''
target_keys = {'类型', 'Host'}
segment_p = r"""
    ^\s*                    # 开头可能存在的空格
    ({})                    # 捕获目标键（类型|Host|解析域名）
    \s*:\s*                 # 冒号及两侧空格
    (.+?)                   # 非贪婪捕获值
    \s*$                    # 结尾可能存在的空格
""".format('|'.join(target_keys))
fangkuohao_p = r"\[(\d+)\]"
# 关键词提取
key_words_p = r"\b(root|system\-logind|systemd|APT|run\-parts|URL地址|发生时间|服务器IP|服务器端口|主机名|攻击特征串|触发规则|访问唯一编号|国家|事件|局域网|LAN|请求方法|标签|动作|威胁|POST数据|省|HTTP/S响应码)\b"

B.fast_tool.py 代码模板用于AI嵌入

import re
from functools import lru_cache
@lru_cache(maxsize=100)
def _compile_regex(pattern: str, flags: int = 0) -> re.Pattern:
    return re.compile(pattern, flags)

def match_type_1(pattern: str, log_text: str) -> list:
    regex = _compile_regex(pattern)
    # Your can use findall() or finditer(), search()
    matches = regex.findall(log_text)
    results = []
    # Your codes or None
    for match in matches:
        results.append({"key": "", "value": match})
    return results
    
def match_type_2(pattern: str, log_text: str) -> list:
    regex = _compile_regex(pattern)
    # Your can use findall() or finditer(), search()
    matches = regex.findall(log_text)
    results = []
    # Your codes or None
    for key, value in matches:
        results.append({"key": key, "value": value})
    return results

def get_components(log_text):
    results = []
    # your codes here
    # example:
    possible_res = match_type_1(r'hostname=(?P<hostname>[^ ]+)', log_text)
    results.extend(possible_res)
    
    return results
    
# 函数调用例子
if __name__ == '__main__':
    log_text = "<128>May 16 14:54:09 2024 dbapp APT~30~1~2024-05-16 14:54:09~10.50.134.18:47013~1.1.1.1:53~远程控制~漏洞利用攻击事件~类型:    C&C~高~2405161454090000256~~请求DNS服务器 [1.1.1.1] 解析域名: oast.pro~~~0~4~2~60:db:15:73:46:01~00:00:5e:00:01:0a~0~Host: oast.pro~~~~成功~12~1~630~212002"
    res = get_components(log_text)
    print(res)

LogParserX框架

在这里插入图片描述
整体框架：

提取report里面的代码并验证它的提取率。
IO文件流：

其中B-D是agents交互llm生成, E-F用于验证。
生成中间量是trace：大模型交互日志，py codes：优化后的规则代码，report：提取结果。

Agents & Tasks

agents 均两两之间上下文开启，并允许代码编译。
输入的logText, logField为训练集的日志。

Pattern Checker & pattern check task

输入manual_patterns, logText, logField, 用于验证pattern的正确性并优化输出patterns.md

Code Generation & code generation task

输入patterns.md(默认)，py codes template, logText, logField, 输出优化后的output.py.

Code Validation & code validation task

输入output.py, logText, logField, 输出优化后的报告report.md，包含codes, 覆盖率分析等。

Testing Part

输入的logText, logField为测试集的日志。
这里主要是验证优化后的opt.py是否能正确提取测试集的logText, logField。

Code Extrator & code extrator task

输入report.md , 输出提取结果opt.py。

Testing Unit & testing unit task

输入opt.py, 测试集的logText, logField, 生成替换后的test.py, 输出测试结果result.txt。

输出 & 中间产物

使用 crewai 和 qwen-llm-api 生成正则表达式模式库，而不是手动模式。
主要的python文件包括：

运行框架：MergeRegexController.py
数据生成：DataGeneration.py，ClassDivision.py
测试单元：RegexChecker.py，Executor.py
知识库：faster_tool.py，pattern.py

最终结果：/output/opt/*.py

即时结果：/output/gen/patterns/*.md，/output/gen/reports/*.md，/output/gen/codes/*.py，/output/test/*.py

LogParserX输出结构

LogParserX
├── output
│   ├── gen
│   │   ├── codes
│   │   │   ├── output_0.py
│   │   │   ├── output_1.py
|   │   ├── patterns
|   │   │   ├── patterns_0.md
|   │   │   ├── patterns_1.md
|   │   ├── reports
|   │   │   ├── report_0.md
|   │   │   ├── report_1.md
|   ├── opt
│   │   ├── opt_0.py
│   │   ├── opt_1.py
|   ├── test
|   │   ├── test_0.py
│   │   ├── test_1.py

提示:

gen/codes/output_*.py: code_generator 生成的初始代码。它们是 markdown 格式的 python 代码。
gen/patterns/patterns_*.md: pattern_checker 生成的正则表达式模式。它们是 markdown 格式的正则表达式模式。
gen/reports/report_*.md: code_validator 生成的报告中的正则表达式和代码，为 markdown 格式的正则表达式报告。
opt/opt_*.py: code_validator 生成的优化代码，为 python 代码。
test/test_*.py: RegexChecker.py 生成的测试代码，为 python 代码。

输出例子

test.py 替换验证：

import re
import json
from functools import lru_cache

@lru_cache(maxsize=100)
def _compile_regex(pattern: str, flags: int = 0) -> re.Pattern:
    return re.compile(pattern, flags)

# Optimized patterns
patterns = {
    "key_value": r"""
        (?:                        # 起始分隔符检测
        (?<=[;:,=(\-])|       # 关键修正：添加冒号:和连字符-作为合法分隔符
        ^)
        \s*                        # 允许前置空格
        (?P<key>                   # 键名规则
            (?![\d\-])             # 不能以数字或连字符开头
            [\w\s.-]+              # 允许字母/数字/空格/点/连字符
        )
        \s*=\s*                    # 等号两侧允许空格
        (?P<value>                 # 值部分
            (?:                   
                (?!\s*[,;)=\-])    # 排除前置分隔符（新增-）
                [^,;)=\-]+         # 基础匹配（新增排除-）
            )+
        )
        (?=                        # 截断预查
            \s*[,;)=\-]|           # 分隔符（新增-）
            \s*$|                  # 字符串结束
            (?=\S+\s*=)            # 后面紧跟新键（含空格键名）
        )
    """,
    "date": r"\b[A-Za-z]{3}\s{1,2}\d{1,2}\s\d{4}\s\d{2}:\d{2}:\d{2}\b",
    "hostname": r"(?<=:\d{2}) ([a-zA-Z0-9._-]+)*(?=\s)",
    "pid": r"([a-zA-Z0-9_-]+)\[(\d+)\]",
    "ip_port": r"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}):(\d{1,5})",
    "session": r"session (\d+)",
    "function": r"(?!%%.*)([a-zA-Z0-9_-]+)\((.*?)\)",
    "HTTPS_code": r"HTTP/S响应码/(\d+)",
    "web_attack": r"WEB攻击~([^~]+)~([^~]*)~([中高低]+)",
    "sys_attack": r"系统告警~+([^~]*)~+([^~]*)~+([中高低]+)~+(\d+)",
    "json_str": r'''
        "([^"]+)"            # 键
        \s*:\s*              # 分隔符
        (                    # 值
            "(?:\\"|[^"])*"  # 字符串（支持转义）
            |\[.*?\]         # 数组
            |-?\d+           # 整数
            |-?\d+\.\d+      # 浮点数
            |true|false|null # 布尔/空值
        )
    ''',
    "target_keys": r"""
        ^\s*                    # 开头可能存在的空格
        ({})                    # 捕获目标键（类型|Host|解析域名）
        \s*:\s*                 # 冒号及两侧空格
        (.+?)                   # 非贪婪捕获值
        \s*$                    # 结尾可能存在的空格
    """.format('|'.join({'类型', 'Host'})),
    "fangkuohao": r"\[(\d+)\]",
    "key_words": r"\b(root|system\-logind|systemd|APT|run\-parts|URL地址|发生时间|服务器IP|服务器端口|主机名|攻击特征串|触发规则|访问唯一编号|国家|事件|局域网|LAN|请求方法|标签|动作|威胁|POST数据|省|HTTP/S响应码)\b"
}

# Define functions to match patterns
def match_key_value(log_text):
    compiled_re = _compile_regex(patterns['key_value'], re.VERBOSE)
    matches = compiled_re.finditer(log_text)
    results = []
    for match in matches:
        key = match.group('key').strip()
        value = match.group('value').strip()
        results.append({"key": key, "value": value})
    return results

def match_date(log_text):
    compiled_re = _compile_regex(patterns['date'])
    match = compiled_re.search(log_text)
    results = []
    if match:
        date = match.group(0)
        results.append({"key": "", "value": date})
    return results

def match_hostname(log_text):
    compiled_re = _compile_regex(patterns['hostname'])
    match = compiled_re.search(log_text)
    results = []
    if match:
        hostname = match.group(1).strip()
        results.append({"key": "", "value": hostname})
    return results

def match_pid(log_text):
    compiled_re = _compile_regex(patterns['pid'])
    match = compiled_re.search(log_text)
    results = []
    if match:
        pid = match.group(2).strip()
        results.append({"key": "", "value": pid})
    return results

def match_ip_port(log_text):
    compiled_re = _compile_regex(patterns['ip_port'])
    matches = compiled_re.finditer(log_text)
    results = []
    for match in matches:
        ip = match.group(1).strip()
        port = match.group(2).strip()
        results.append({"key": "服务器IP", "value": ip})
        results.append({"key": "服务器端口", "value": port})
    return results

def match_session(log_text):
    compiled_re = _compile_regex(patterns['session'])
    match = compiled_re.search(log_text)
    results = []
    if match:
        session = match.group(1).strip()
        results.append({"key": "session", "value": session})
    return results

def match_function(log_text):
    compiled_re = _compile_regex(patterns['function'])
    matches = compiled_re.finditer(log_text)
    results = []
    for match in matches:
        function = match.group(1).strip()
        args = match.group(2).strip()
        results.append({"key": function, "value": args})
    return results

def match_HTTPS_code(log_text):
    compiled_re = _compile_regex(patterns['HTTPS_code'])
    match = compiled_re.search(log_text)
    results = []
    if match:
        code = match.group(1).strip()
        results.append({"key": "HTTP/S响应码", "value": code})
    return results

def match_web_attack(log_text):
    compiled_re = _compile_regex(patterns['web_attack'])
    match = compiled_re.search(log_text)
    results = []
    if match:
        attack_type = match.group(1).strip()
        attack_info = match.group(2).strip()
        threat_level = match.group(3).strip()
        results.append({"key": "WEB攻击类型", "value": attack_type})
        results.append({"key": "WEB攻击信息", "value": attack_info})
        results.append({"key": "威胁等级", "value": threat_level})
    return results

def match_sys_attack(log_text):
    compiled_re = _compile_regex(patterns['sys_attack'])
    match = compiled_re.search(log_text)
    results = []
    if match:
        attack_type = match.group(1).strip()
        attack_info = match.group(2).strip()
        threat_level = match.group(3).strip()
        count = match.group(4).strip()
        results.append({"key": "系统告警类型", "value": attack_type})
        results.append({"key": "系统告警信息", "value": attack_info})
        results.append({"key": "威胁等级", "value": threat_level})
        results.append({"key": "次数", "value": count})
    return results

def match_json_str(log_text):
    compiled_re = _compile_regex(patterns['json_str'], re.VERBOSE)
    matches = compiled_re.finditer(log_text)
    results = []
    for match in matches:
        key = match.group(1).strip()
        value = match.group(2).strip()
        results.append({"key": key, "value": value})
    return results

def match_target_keys(log_text):
    compiled_re = _compile_regex(patterns['target_keys'], re.VERBOSE)
    matches = compiled_re.finditer(log_text)
    results = []
    for match in matches:
        key = match.group(1).strip()
        value = match.group(2).strip()
        results.append({"key": key, "value": value})
    return results

def match_fangkuohao(log_text):
    compiled_re = _compile_regex(patterns['fangkuohao'])
    matches = compiled_re.finditer(log_text)
    results = []
    for match in matches:
        number = match.group(1).strip()
        results.append({"key": "方括号内数字", "value": number})
    return results

def match_key_words(log_text):
    compiled_re = _compile_regex(patterns['key_words'])
    matches = compiled_re.finditer(log_text)
    results = []
    for match in matches:
        keyword = match.group(0).strip()
        results.append({"key": keyword, "value": keyword})
    return results

def get_components(log_text):
    results = []
    results.extend(match_date(log_text))
    results.extend(match_hostname(log_text))
    results.extend(match_pid(log_text))
    results.extend(match_ip_port(log_text))
    results.extend(match_session(log_text))
    results.extend(match_function(log_text))
    results.extend(match_HTTPS_code(log_text))
    results.extend(match_web_attack(log_text))
    results.extend(match_sys_attack(log_text))
    results.extend(match_json_str(log_text))
    results.extend(match_target_keys(log_text))
    results.extend(match_fangkuohao(log_text))
    results.extend(match_key_words(log_text))
    results.extend(match_key_value(log_text))
    return results

if __name__ == '__main__':
    log_text = f"""<178>Nov 15 14:22:33 10.50.81.60 DBAppWAF: 发生时间/2024-11-15 14:22:32,威胁/高,事件/SQL注入,请求方法/POST,URL地址/10.50.81.60:8000/login.php,POST数据/user=admin&password=123456,服务器IP/10.50.81.6,主机名/10.50.81.60:8000,服务器端口/8000,客户端IP/10.20.170.23,客户端端口/45678,客户端环境/Mozilla/5.0 [en] (X11, U; DBAPPSecurity 21.4.3),标签/SQL注入,动作/阻断,HTTP/S响应码/403,攻击特征串/login.php?user=admin&password=123456,触发规则/11010016,访问唯一编号/7425395334018236553,国家/LAN,省/,市/,XFF_IP/"""
    res = get_components(log_text)
    json_data = json.dumps(res, ensure_ascii=False)
    print(json_data)

运行代码结果：

cxx@cxx-Legion-Y9000P-IRX9:~/LogParserX$
[{"key": "", "value": "10.50.81.60"}, {"key": "服务器IP", "value": "10.50.81.60"}, {"key": "服务器端口", "value": "8000"}, {"key": "服务器IP", "value": "10.50.81.60"}, {"key": "服务器端口", "value": "8000"}, {"key": "HTTP/S响应码", "value": "403"}, {"key": "发生时间", "value": "发生时间"}, {"key": "威胁", "value": "威胁"}, {"key": "事件", "value": "事件"}, {"key": "请求方法", "value": "请求方法"}, {"key": "URL地址", "value": "URL地址"}, {"key": "POST数据", "value": "POST数据"}, {"key": "服务器IP", "value": "服务器IP"}, {"key": "主机名", "value": "主机名"}, {"key": "服务器端口", "value": "服务器端口"}, {"key": "标签", "value": "标签"}, {"key": "动作", "value": "动作"}, {"key": "HTTP/S响应码", "value": "HTTP/S响应码"}, {"key": "攻击特征串", "value": "攻击特征串"}, {"key": "触发规则", "value": "触发规则"}, {"key": "访问唯一编号", "value": "访问唯一编号"}, {"key": "国家", "value": "国家"}, {"key": "LAN", "value": "LAN"}, {"key": "省", "value": "省"}]

结果分析

目前使用了前两个数据集的前50条跑了结果（每次都可能不一样的结果）：
谁想跑剩下的部分也可以跑，我生成的测试集都有在仓库里面。~~谁是大怨种？~~

类别	覆盖率 (1.0)	匹配率 (1.0)	完全匹配率（1.0）	分数（1.0）	覆盖率<0.7 (条数)
class_1 (开发集)	0.77	0.96	0.26	0.54	16
class_1 (测试集)	0.76	0.96	0.22	0.516	18
class_2 (开发集)	0.55	0.96	0.08	0.432	26
class_2 (测试集)	0.55	0.96	0.06	0.42	26

主要是完全匹配率较低，尤其在复杂的组合下，很难得到完美的提炼规则。应该修改的是优化代码部分的结构，需要设计一个专用于分析为什么没有提取到对应内容的代理用于正则代码的分析工作，再把这个上下文传递给下一个代理用于代码优化。
如果直接衔接优化代码的代理就只能得道一次优化的代码结果和分析，主要是一般这里模型都会幻觉严重，就算没有完全匹配，它也会输出的是完全匹配的分数分析，这样就会影响下一步工作的判断和分析。
所以，这里应该注重的部分是内容分析而不是分数奖励，一味地对模型进行低分提示，并不能够很好地让它向正确的方向思考，我个人想法是针对一些案例提供md格式的纠错内容分析例子，旨在对于内容纠错的指向性提示。
硬编码硬筛应该可以弥补一点它的表现，但是我不想写硬编码…(~~哦漏！~~)
另外，关于智能体工作的部分，可以要求其在保证代码正确的前提下才使用代码编译器，不然速度会很慢。后面的工作就先不优化了，反正没进决赛，我Blue得要死。

关键代码

环境配置 .env

OPENAI_API_BASE="https://XXXX/api/v1"
OPENAI_API_KEY="sk-XXXXXXXXXXXXXXXXXXXXXX"
MODEL_NAME="openai/qwen2.5-72b-instruct"
Temperature=0.1
max_tokens=4096
# Path
root="src/LogParserX"
python_tool = "src/LogParserX/knowledge/faster_tool.py"
python_pattern = "src/LogParserX/knowledge/pattern.py"
output_file = "src/LogParserX/output/gen/codes/output_{}.py"
output_file_p = "src/LogParserX/output/gen/patterns/pattern_{}.md"
output_file_md = "src/LogParserX/output/gen/reports/report_{}.md"
opt_file = "src/LogParserX/output/opt/opt_{}.py"
data_set = "src/LogParserX/data/{}.json"

开发集 `dataset.json` 15条

[
    {
        "logId": 0,
        "logText": "<21>Aug 13 09:04:02 soc-32 systemd-logind: Removed session 3831379.",
        "logField": [
            {
                "key": "",
                "value": "Aug 13 09:04:02"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "systemd-logind"
            },
            {
                "key": "",
                "value": "3831379"
            }
        ]
    },
    {
        "logId": 1,
        "logText": "<21>Oct 28 18:00:09 soc-32 ntpdate[172578]: adjust time server 120.25.115.20 offset 0.000752 sec",
        "logField": [
            {
                "key": "",
                "value": "Oct 28 18:00:09"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "ntpdate"
            },
            {
                "key": "",
                "value": "172578"
            },
            {
                "key": "",
                "value": "120.25.115.20"
            }
        ]
    },
    {
        "logId": 2,
        "logText": "<21>Oct 28 17:58:09 soc-32 systemd: lgent.service: main process exited, code=exited, status=2/INVALIDARGUMENT",
        "logField": [
            {
                "key": "",
                "value": "Oct 28 17:58:09"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "systemd"
            },
            {
                "key": "code",
                "value": "exited"
            },
            {
                "key": "status",
                "value": "2/INVALIDARGUMENT"
            }
        ]
    },
    {
        "logId": 3,
        "logText": "<21>Aug 12 08:06:01 soc-32 sshd[16209]: Postponed publickey for root from 3.66.0.23 port 38316 ssh2 [preauth]",
        "logField": [
            {
                "key": "",
                "value": "Aug 12 08:06:01"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "sshd"
            },
            {
                "key": "",
                "value": "16209"
            },
            {
                "key": "",
                "value": "root"
            },
            {
                "key": "",
                "value": "3.66.0.23"
            },
            {
                "key": "",
                "value": "38316"
            },
            {
                "key": "",
                "value": "ssh2"
            },
            {
                "key": "",
                "value": "preauth"
            }
        ]
    },
    {
        "logId": 4,
        "logText": "<21>Aug 12 08:11:56 soc-32 sshd[33101]: pam_unix(sshd:session): session closed for user root",
        "logField": [
            {
                "key": "",
                "value": "Aug 12 08:11:56"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "sshd"
            },
            {
                "key": "",
                "value": "33101"
            }
        ]
    },
    {
        "logId": 5,
        "logText": "<21>Oct 28 17:57:09 soc-32 systemd-logind: New session 4996668 of user root.",
        "logField": [
            {
                "key": "",
                "value": "Oct 28 17:57:09"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "systemd-logind"
            },
            {
                "key": "",
                "value": "4996668"
            }
        ]
    },
    {
        "logId": 6,
        "logText": "<21>Aug 12 07:38:43 soc-32 sshd[138033]: Postponed publickey for root from 3.66.0.23 port 38140 ssh2 [preauth]",
        "logField": [
            {
                "key": "",
                "value": "Aug 12 07:38:43"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "sshd"
            },
            {
                "key": "",
                "value": "138033"
            },
            {
                "key": "",
                "value": "root"
            },
            {
                "key": "",
                "value": "3.66.0.23"
            },
            {
                "key": "",
                "value": "38140"
            },
            {
                "key": "",
                "value": "ssh2"
            },
            {
                "key": "",
                "value": "preauth"
            }
        ]
    },
    {
        "logId": 7,
        "logText": "<21>Jul 29 07:31:56 soc-32 sshd[60636]: Postponed publickey for root from 3.66.0.23 port 48454 ssh2 [preauth]",
        "logField": [
            {
                "key": "",
                "value": "Jul 29 07:31:56"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "sshd"
            },
            {
                "key": "",
                "value": "60636"
            },
            {
                "key": "",
                "value": "root"
            },
            {
                "key": "",
                "value": "3.66.0.23"
            },
            {
                "key": "",
                "value": "48454"
            },
            {
                "key": "",
                "value": "ssh2"
            },
            {
                "key": "",
                "value": "preauth"
            }
        ]
    },
    {
        "logId": 8,
        "logText": "<21>Jul 29 07:42:11 soc-32 sshd[89018]: Postponed publickey for root from 3.66.0.23 port 42736 ssh2 [preauth]",
        "logField": [
            {
                "key": "",
                "value": "Jul 29 07:42:11"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "sshd"
            },
            {
                "key": "",
                "value": "89018"
            },
            {
                "key": "",
                "value": "root"
            },
            {
                "key": "",
                "value": "3.66.0.23"
            },
            {
                "key": "",
                "value": "42736"
            },
            {
                "key": "",
                "value": "ssh2"
            },
            {
                "key": "",
                "value": "preauth"
            }
        ]
    },
    {
        "logId": 9,
        "logText": "<21>Aug 12 07:14:12 soc-32 sshd[71841]: Postponed publickey for root from 3.66.0.23 port 43604 ssh2 [preauth]",
        "logField": [
            {
                "key": "",
                "value": "Aug 12 07:14:12"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "sshd"
            },
            {
                "key": "",
                "value": "71841"
            },
            {
                "key": "",
                "value": "root"
            },
            {
                "key": "",
                "value": "3.66.0.23"
            },
            {
                "key": "",
                "value": "43604"
            },
            {
                "key": "",
                "value": "ssh2"
            },
            {
                "key": "",
                "value": "preauth"
            }
        ]
    },
    {
        "logId": 10,
        "logText": "Oct 29 00:00:01 soc-32 CROND[26434]: (root) CMD (/usr/lib64/sa/sa1 1 1)",
        "logField": [
            {
                "key": "",
                "value": "Oct 29 00:00:01"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "CROND"
            },
            {
                "key": "",
                "value": "26434"
            },
            {
                "key": "",
                "value": "root"
            },
            {
                "key": "",
                "value": "CMD"
            }
        ]
    },
    {
        "logId": 11,
        "logText": "<21>Aug 13 09:05:17 soc-32 systemd: lgent.service holdoff time over, scheduling restart.",
        "logField": [
            {
                "key": "",
                "value": "Aug 13 09:05:17"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "systemd"
            }
        ]
    },
    {
        "logId": 12,
        "logText": "<21>Jul 16 16:33:39 soc-32 systemd: Started Session 3405658 of user root.",
        "logField": [
            {
                "key": "",
                "value": "Jul 16 16:33:39"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "systemd"
            },
            {
                "key": "",
                "value": "3405658"
            },
            {
                "key": "",
                "value": "root"
            }
        ]
    },
    {
        "logId": 13,
        "logText": "<21>Jul 29 07:12:58 soc-32 sshd[7246]: Postponed publickey for root from 3.66.0.23 port 35052 ssh2 [preauth]",
        "logField": [
            {
                "key": "",
                "value": "Jul 29 07:12:58"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "sshd"
            },
            {
                "key": "",
                "value": "7246"
            },
            {
                "key": "",
                "value": "3.66.0.23"
            },
            {
                "key": "",
                "value": "35052"
            },
            {
                "key": "",
                "value": "ssh2"
            },
            {
                "key": "",
                "value": "preauth"
            }
        ]
    },
    {
        "logId": 14,
        "logText": "<21>Oct 28 10:11:01 soc-32 CROND[2100]: (root) CMD (/usr/bin/bash /data/soc/soc_upgrade_dir/scripts/check_status.sh &> /dev/null)",
        "logField": [
            {
                "key": "",
                "value": "Oct 28 10:11:01"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "CROND"
            },
            {
                "key": "",
                "value": "2100"
            },
            {
                "key": "",
                "value": "root"
            },
            {
                "key": "",
                "value": "CMD"
            }
        ]
    },
    {
        "logId": 15,
        "logText": "<21>Jul 29 16:57:28 soc-32 systemd: Started Client agent got collecting & sending logs & metrics..",
        "logField": [
            {
                "key": "",
                "value": "Jul 29 16:57:28"
            },
            {
                "key": "",
                "value": "soc-32"
            },
            {
                "key": "",
                "value": "systemd"
            }
        ]
    }
 ]

测试集生成 `DataGenerator.py`

import datetime
import json
import os
import re
from langchain_openai import ChatOpenAI  
from dotenv import load_dotenv
from crewai import Agent, Task, Process, Crew
load_dotenv(override=True)

# model config  
QWEN_MODEL_NAME = os.getenv("MODEL_NAME")
QWEN_API_BASE = os.getenv("OPENAI_API_BASE")
QWEN_API_KEY = os.getenv("OPENAI_API_KEY")
# EMBED_MODEL_NAME = os.getenv("EMBED_MODEL_NAME")
Temperature = os.getenv("Temperature")
max_tokens = os.getenv("max_tokens")

qwen = ChatOpenAI(
			model=QWEN_MODEL_NAME,
			openai_api_base=QWEN_API_BASE,
			openai_api_key=QWEN_API_KEY,
			temperature=Temperature,
			max_tokens=max_tokens,
			streaming=False,
            timeout=60
		)

log_generator = Agent(
    role="Log Info Generator",
    
    goal="Generate the same format of given log, make them belong to the same source.",

    backstory="""You are an experienced expert for log extraction and log generation, through scan log and extract keywords from given records, 
    can get enough features of log and generate target log info. You can return clean log info with the same structure and different contexts 
    and make the given logs and generated log from the same source.""",
    llm=qwen,
)

data_generation_task = Task(
    description= """You are given a log: {log}, you should generate a log with the same format and different context and belong to the same source.
    For example, your given example is:
    {{
        "logId": 4,
        "logText": "<21>Aug 12 08:11:56 soc-32 sshd[33101]: pam_unix(sshd:session): session closed for user root",
        "logField": [
            {{
                "key": "",
                "value": "Aug 12 08:11:56"
            }},
            {{
                "key": "",
                "value": "soc-32"
            }},
            {{
                "key": "",
                "value": "sshd"
            }},
            {{
                "key": "",
                "value": "33101"
            }}
        ]
    }},
    Your generated example is as follows:
    {{
        "logId": 4,
        "logText": "<21>Feb 12 23:11:44 cxx-Legion sshd[123456]: pam_unix(sshd:session): session closed for user cxx",
        "logField": [
            {{
                "key": "",
                "value": "Feb 12 23:11:44"
            }},
            {{
                "key": "",
                "value": "cxx-Legion"
            }},
            {{
                "key": "",
                "value": "sshd"
            }},
            {{
                "key": "",
                "value": "123456"
            }}
        ]
    }},
    Your generated result should only include the whole log  without any explanation or other texts.
    """,
    agent=log_generator,
    expected_output=
    """
    {
        "logId": ??,
        "logText": "????",
        "logField": [
            {
                "key": "",
                "value": "???"
            },
            {
                "key": "",
                "value": "???"
            },
            {
                "key": "",
                "value": "???"
            },
            {
                "key": "",
                "value": "???"
            },
            {
                "key": "",
                "value": "???"
            },
        ]
    }
    """
)

def get_generated_log(text):
    valid_json = text.replace("'", "\"")
    valid_json = text.replace("True", "true")
    try:
        data = json.loads(valid_json)
        return data
    except json.JSONDecodeError as e:   
        print(f"UnHandled JSON: {e}")
        try: 
            data = auto_escape_json(valid_json)
            return data
        except json.JSONDecodeError as e:   
            print(f"Error decoding JSON: {e}")
        

def auto_escape_json(json_str):
    try:
        # 使用正则表达式匹配 JSON 数据
        # 提取 logId, logText 和 logField
        # 假设 JSON 数据的结构是固定的
        # 解析 logId
        log_id_match = re.search(r'"logId":\s*(\d+),', json_str)
        log_id = int(log_id_match.group(1)) if log_id_match else None

        # 解析 logText
        log_text_match = re.search(r'"logText":\s*"([^"]+)",', json_str)
        log_text = log_text_match.group(1).replace('"', '\\"') if log_text_match else None

        # 解析 logField
        log_field_match = re.search(r'"logField":\s*(\[[\s\S]*?\])', json_str)
        log_field_json = log_field_match.group(1) if log_field_match else None

        # 解析 logField
        log_field = json.loads(log_field_json) if log_field_json else []
        for field in log_field:
            if hasattr(field, 'get'):
                field_value = field.get('value')
                if field_value:
                    field['value'] = field_value.replace('"', '\\"')

        # 重新生成 JSON 数据
        data = {
            "logId": log_id,
            "logText": log_text,
            "logField": log_field
        }
        return json.dumps(data, ensure_ascii=False, indent=4)
    except Exception as e:
        raise ValueError(f"无法自动修复 JSON 格式: {e}")

def generate_log_fileName():
    """
    根据当前时间生成日志文件路径
    Returns:
        str: 完整的日志文件路径
    """
    # 日志目录，根据自己项目修改
    log_dir = "src/LogParserX/log"  
    os.makedirs(log_dir, exist_ok=True)
    # 生成精确到秒的时间戳
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    # 返回完整日志文件路径
    return os.path.join(log_dir, f'crewai_{timestamp}.log')

def run(test_data, output_file):
    generated_list = []
    for item in test_data:
        single_crew = Crew(
            agents=[log_generator],
            tasks=[data_generation_task],
            process=Process.sequential,
            verbose=True,
            output_log_file=generate_log_fileName()
        )
        inputs = {
            "log": f"{item}",
        }
        result = single_crew.kickoff(inputs=inputs)
        print("C")
        print(40* "#")
        print(result)
        print(40* "#")
        # res = get_generated_log(str(result))
        res = str(result)
        generated_list.append(res)
    # print(generated_list)
    # with open(output_file, 'w', encoding='utf-8') as f:
    #     json.dump(generated_list, f, indent=4, ensure_ascii=False) 
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write('\n'.join(generated_list))
    print(f"Generated data saved to {output_file}!")
    # with open(output_file, 'w', encoding='utf-8') as f:
    #     f.write('\n'.join(generated_list))
    # print(f"Generated data saved to {output_file}!")

def launcher(s, e, log_path):
    json_data = json.load(open(log_path, "r", encoding="utf-8"))
    test_data = json_data[s:e]
    print(len(test_data))
    run(test_data=test_data, output_file="data/generated_data/class_4.txt")

if __name__ == '__main__':
    launcher(0, 100, "data/classified_data/class_4.json")

智能体学习 `MergeRegexController.py`

import datetime
import json
import os
from langchain_openai import ChatOpenAI  
from dotenv import load_dotenv
from crewai import Agent, Task, Process, Crew
load_dotenv(override=True)

call_logs = []
log_id_counter = 0

QWEN_MODEL_NAME = os.getenv("MODEL_NAME")
QWEN_API_BASE = os.getenv("OPENAI_API_BASE")
QWEN_API_KEY = os.getenv("OPENAI_API_KEY")
# EMBED_MODEL_NAME = os.getenv("EMBED_MODEL_NAME")
Temperature = os.getenv("Temperature")
max_tokens = os.getenv("max_tokens")
# print(f"QWEN_MODEL_NAME: {QWEN_MODEL_NAME}, QWEN_API_BASE: {QWEN_API_BASE}, QWEN_API_KEY: {QWEN_API_KEY}, Temperature: {Temperature}, max_tokens: {max_tokens}")
qwen = ChatOpenAI(
			model=QWEN_MODEL_NAME,
			openai_api_base=QWEN_API_BASE,
			openai_api_key=QWEN_API_KEY,
			temperature=Temperature,
			max_tokens=max_tokens,
			streaming=False,
            timeout=60
		)

pattern_checker = Agent(
    
    role="Regex Pattern Checker",
    
    goal="Check if the regular expression pattern is correct and precise for given logText and logField data",

    backstory="""You are a regular expression pattern checker with experience in regular expressions.
    You can check if the regular expression pattern is correct and precise for given logText and logField data.
    Correct and precise regular expression patterns should be applied to logText and get the same results as logField. 
    Try to make your regular expression pattern as precise as possible to cover all possible conditions as enough as possible.
    You can use any Python libraries and modules and check the correctness of your regular expression patterns through execution them.""",
    allow_code_execution=True,
    llm=qwen,
    memory=True,
)

pattern_check_task = Task(
    description= """Check if the regular expression pattern is correct and precise for given logText and logField data.
    Your logText: {logText}, Your logField: {logField}, Your pattern: {pattern}, your pattern should be correct and precise to match to the logText and get results as logField.
    Pay attention to the key-value pairs, the key and value should all come from the logText, allow key to be empty, but value should not be empty.
    Your pattern should be correct and precise to match to the logText and get results as logField (cover more items as possible).
    Here is an example of a regular expression pattern.
    You can reason step by step instead of completing only one regular expression for all conditions.
    Your logText: "<164>Nov  5 2021 11:34:18+08:00 ME60-1 %%01BRASAM/4/hwAllocUserIPFailAlarm (t):VS=Admin-VS-CID=0x81d80420-OID=1.3.6.1.4.1.2011.6.8.2.2.0.3;Fail to alloc IP address from domain. (DomainNo.=72,DomainName=vlan3260)"
    Your logField: 
    [
        {{
            "key": "",
            "value": "Nov  5 2021 11:34:18+08:00"
        }},
        {{
            "key": "",
            "value": "ME60-1"
        }},
        {{
            "key": "VS",
            "value": "Admin"
        }},
        {{
            "key": "VS-CID",
            "value": "0x81d80420"
        }},
        {{
            "key": "OID",
            "value": "1.3.6.1.4.1.2011.6.8.2.2.0.3"
        }},
        {{
            "key": "DomainNo.",
            "value": "72"
        }},
        {{
            "key": "DomainName",
            "value": "vlan3260"
        }}
    ]
    """,
    agent=pattern_checker,
    expected_output=
    """
    Optimized Pattern:
    date_p = r"\b[A-Za-z]{{3}}\s{1,2}\d{1,2}\s\d{4}\s\d{2}:\d{2}:\d{2}\b"
    date_p_ = r"\b([A-Za-z]+ \d{1,2} \d{4} \d{2}:\d{2}:\d{2})\b"
    date_p_2 = r"([A-Za-z]{3})\s+ (\d{1,2})\s+(\d{4})\s+(\d{2}):(\d{2}):(\d{2})([+-]\d{2}):(\d{2})"
    date_p_3 = r"(\d{4}-\d{1,2}-\d{1,2} \d{2}:\d{2}:\d{2}(?:[+-]\d{2}:\d{2})?)"
    Optimized Reasons:
    - This regex can face some false positives, such as "Nov 5 2021 11:34:18+08:00"
    - Fix some unmatched conditions, such as "Nov  5 2021 11:34:18+08:00", and why use optimized pattern can solve this problem.
    - This regex can face some false positives, such as "Nov 5 2021 11:34:18+08:00", ...
    ...
    Optimized Rate:
    Compared to the original pattern, the optimized pattern can cover X%, except for some conditions: XXX.
    """, 
    output_file="{output_file_p}",
)

code_generator = Agent(
    role="Regex Python Code Generator",
    
    goal="Generate precise regular expressions codes with Python",

    backstory="""You are a Python code generator with experience in regular expressions.
    You can generate corresponding python codes for regular expressions from labeled data.
    With given labeled data and standard answers, your generated codes can be semantical and precise.
    You are allowed to use any Python libraries and modules and check the correctness of your generated codes through execution them.""",
    llm=qwen,
    allow_code_execution=True,
    memory = True,
)

code_generation_task = Task(
    description="""Generate code based on verification results:
    Log sample: {logText}
    Target field: {logField},
    Python Code Template: {python_code},
    Read Report from Pattern Checker, and use the optimized pattern to generate Python codes.
    If the optimized pattern is not correct, you can modify it and re-run the code generation task.
    Execute the generated codes and check if the results match the logField.
    You should generate codes in Python that can match the logText to the logField with the verified pattern.
    You had better return clear codes instead of markdown format with starting and ending quotes.
    For example: ```python```""", # Explicitly reference upstream output
    agent=code_generator,
    context=[pattern_check_task], # Establish dependency chain
    expected_output =
    """Python function containing the following elements:
    - Use the optimized patterns
    - Complete all functions and variables with proper values
    - The codes can be executed and return the expected results
    - Use python format instead of markdown format for better readability
    - Only python codes are allowed, no markdown format is allowed

    For example(clean codes), your codes should be **strict** like this, main function only change log_text contents:
    import re
    import json
    from functools import lru_cache
    @lru_cache(maxsize=100)
    def _compile_regex(pattern: str, flags: int = 0) -> re.Pattern:
        return re.compile(pattern, flags)
    # use optimized pattern
    patterns = {
        "pattern_name": "",
        "date": r"\b[A-Za-z]{{3}}\s{1,2}\d{1,2}\s\d{2}:\d{2}:\d{2}\b",
        "hostname": r"(?<=:\d{2}) ([a-zA-Z0-9._-]+)",
        "pid": r"([a-zA-Z0-9_-]+)\[(\d+)\]",
        ...
    }
    # define functions like match_{pattern_name}
    def match_date(text):
        compiled_re = _compile_regex(patterns['date'])
        match = compiled_re.search(text)
        results = []
        if match:
            date = match.group(0)
            results.append({"key": "", "value": date})
            print("ISO Date Results:", results)
            return results
        return [] 
    # other functions
    ...
    def get_components(log_text):
        res = match_date(log_text)
        ...
        return res

    if __name__ == '__main__':
        log_text = {{logText}}
        res = get_components(log_text)
        json_data = json.dumps(res, ensure_ascii=False)
        print(json_data)
    """,
    output_file="{output_file}",
)


code_validater = Agent(
    role="Regex Python Code Validator",
    goal="""Validate the generated Python codes by executing them and checking the results, try to find ismatched context and give analysis for codes.
    Try to increase the macth rate of original codes by modifying the codes and re-run the validation task.""",
    backstory="""You are a Python code validator with experience in regular expressions. 
    You can validate the generated Python codes by executing them and checking the results.
    You can find ismatched context and give analysis for codes.
    You can modify the codes and re-run the validation task to increase the macth rate of original codes.""",
    llm=qwen,
    allow_code_execution=True,
    memory = True,
)

code_validation_task = Task(
    description="""Validate the generated Python codes by executing them and checking the results.
    You should execute the generated codes and check if the results match the logField.
    Pay attention to the key-value pairs, the key and value should all come from the logText, allow key to be empty, but value should not be empty.
    Do not try to assign type for key when key does not occur in logText!
    For example:
    logText = "2023-10-10 10:10:10 ABC ERROR: This is an error message"
    logField = [{{"key": "", "value": "2023-10-10 10:10:10"}}, {{"key": "", "value": "ABC"}}, {{"key": "", "value": "ERROR"}}]
    In this logField, three key is empty because they are not in logText. Date, hostname and level these types are pattern types.
    Your pattern should be correct and precise to match to the logText and get results as logField (cover more items as possible). 
    If the results do not match, you should modify the codes and re-run the validation task.
    If the results match, you can submit the codes to the code review team for review.
    """,
    agent=code_validater,
    context=[code_generation_task],
    expected_output="""A markdown report containing the following elements:
    - The generated codes are executed and return the expected results
    - The results match the logField 
    - The matche rate and comparison with the original codes are provided (must completely match, include key and value)
    Like this format:
    # Optimized Codes Analysis
    ## Optimized Codes
    ```python
    ...
    ```
    ## Output
    ```txt
    {"key": "", "value": ""}
    ```
    ## Comparison
    Optimized codes Matched Rate: X%
    Original codes Matched Rate: Y%
    In Optimized codes, [{"key": "", "value": ""},...] are matched, while ... are unmatched.
    In Original codes, [{"key": "", "value": ""},...] are matched, while ... are unmatched.
    """,
    output_file="{output_file_md}",
    )

def get_str(file_path):
    with open(file_path, "r", encoding="utf-8") as f:
        res = f.read()
    return res


def record_log(file_path, st):
    with open(file_path, "a", encoding="utf-8") as f:
        for i in st:
            f.write(str(i))
            f.write("\n")
    print(f"{file_path} recorded!")


def add_log(step, id, inputs, outputs):
    item = {
        "step": step,
        "logId": id,
        "inputs": inputs,
        "outputs": outputs
    }
    return item

def generate_log_fileName():
    """
    根据当前时间生成日志文件路径
    Returns:
        str: 完整的日志文件路径
    """
    # 日志目录，根据自己项目修改
    log_dir = "src/LogParserX/log"  
    os.makedirs(log_dir, exist_ok=True)
    # 生成精确到秒的时间戳
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    # 返回完整日志文件路径
    return os.path.join(log_dir, f'crewai_{timestamp}.log')

def run(test_data, pattern, python_code, output_file, output_file_p, output_file_md):
    record_list = []
    step = 0
    for item in test_data:
        single_crew = Crew(
            agents=[pattern_checker, code_generator, code_validater],
            tasks=[pattern_check_task, code_generation_task, code_validation_task],
            process=Process.sequential,
            verbose=True,
            output_log_file=generate_log_fileName()
        )
        log_id = item["logId"]
        log_text = item["logText"]
        log_field = item["logField"]
        inputs = {
            "logText": f"{log_text}",
            "logField": f"{log_field}",
            "pattern": f"{pattern}",
            "python_code": f"{python_code}",
            "output_file": output_file.format(log_id),
            "output_file_p": output_file_p.format(log_id),
            "output_file_md": output_file_md.format(log_id),
        }
        result = single_crew.kickoff(inputs=inputs)

        print(40*"#")
        print(result)
        print(40*"#")
        
        item = add_log(step, log_id, inputs, str(result))
        step += 1
        record_list.append(item)

    record_log("src/LogParserX/trace/trace_{}.txt".
               format(datetime.datetime.now().strftime("%Y%m%d%H%M%S")), record_list)
    # print(record_list)
    
def launcher(S, E, class_path):
    python_tool = r"src/LogParserX/knowledge/faster_tool.py"
    python_pattern = r"src/LogParserX/knowledge/pattern.py"
    output_file = r"src/LogParserX/output/gen/codes/output_{}.py"
    output_file_p = r"src/LogParserX/output/gen/patterns/pattern_{}.md"
    output_file_md = r"src/LogParserX/output/gen/reports/report_{}.md"
    with open(python_tool, "r", encoding="utf-8")as f:
        python_code = f.read()
    with open(python_pattern, "r", encoding="utf-8")as f:
        pattern = f.read()
    data = json.load(open(class_path, "r", encoding="utf-8"))
    test_data= data[S:E]
    run(test_data, pattern, python_code, output_file, output_file_p, output_file_md)
    print("Done!")

if __name__ == '__main__':
    class_path = r"data/classified_data/class_2.json"
    launcher(S=0,E=50, class_path=class_path)

运行python代码 `Executor.py`

import io
import json
import re
import sys
import time

def get_clear_python_code(file_path, output_path):
    """
    从Markdown格式的文件中提取Python代码并保存为.py文件。
    
    参数:
        file_path (str): 输入的Markdown文件路径。
    """
    # 打开并读取Markdown文件
    with open(file_path, 'r', encoding='utf-8') as f:
        code = f.readlines()
    
    code = code[1:-1]

    # 保存为.py文件
    out = output_path 
    with open(out, 'w', encoding='utf-8') as f:
        f.write(''.join(code))  # 用空行分隔代码块

    print(f"Python代码已提取并保存为 {out}")

def add_main(logtext, path):
    main_code = f"""if __name__ == '__main__':
    log_text = "{logtext}"
    result = get_components(log_text)
    print(result)
    """
    with open(path, 'a', encoding='utf-8') as f:
        f.write(main_code)
    print(f"添加main函数成功，并保存为 {path}")
import subprocess
import sys
from pathlib import Path
from typing import Dict, Optional

def execute_python_code(file_path: str, timeout: int = 5) -> Dict[str, Optional[str]]:
    """
    执行Python代码并捕获输出及执行时间
    
    :param file_path: Python文件路径
    :param timeout: 超时时间（秒）
    :return: 包含执行结果和时间的字典 {
        "output": 标准输出,
        "error": 错误信息,
        "return_code": 返回码,
        "execution_time": 执行时间(秒)
    }
    """
    result = {
        "output": None,
        "error": None,
        "return_code": None,
        "execution_time": None
    }
    
    try:
        # 验证文件
        path = Path(file_path)
        if not path.exists():
            raise FileNotFoundError(f"文件不存在: {file_path}")
        if path.suffix.lower() != '.py':
            raise ValueError("仅支持.py文件")

        # 记录开始时间
        start_time = time.perf_counter()

        # 执行代码
        process = subprocess.run(
            [sys.executable, str(path)],
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            timeout=timeout,
            encoding='utf-8',
            errors='ignore'
        )

        # 计算耗时
        end_time = time.perf_counter()
        elapsed = round(end_time - start_time, 3)  # 保留3位小数

        result.update({
            "output": process.stdout.strip(),
            "error": process.stderr.strip(),
            "return_code": process.returncode,
            "execution_time": elapsed
        })
        

    except subprocess.TimeoutExpired as e:
        # 计算实际超时耗时
        elapsed = round(time.perf_counter() - start_time, 3)
        result.update({
            "error": f"执行超时（设置{timeout}秒，实际耗时{elapsed}秒）",
            "execution_time": elapsed
        })
    except Exception as e:
        # 计算异常发生时的耗时
        elapsed = round(time.perf_counter() - start_time, 3) if 'start_time' in locals() else 0.0
        result.update({
            "error": f"执行失败: {str(e)}",
            "execution_time": elapsed
        })
    
    return result

def get_all_files(dir_path: str, suffix: str = None) -> list:
    """
    获取目录下所有文件路径
    
    :param dir_path: 目录路径
    :param suffix: 文件后缀
    :return: 文件路径列表
    """
    path = Path(dir_path)
    if not path.exists():
        raise FileNotFoundError(f"目录不存在: {dir_path}")
    if not path.is_dir():
        raise NotADirectoryError(f"不是目录: {dir_path}")

    files = []
    for p in path.iterdir():
        if p.is_file():
            if suffix is None or p.suffix.lower() == suffix.lower():
                files.append(str(p))
        elif p.is_dir():
            files.extend(get_all_files(str(p), suffix))
    # return files
    f = []
    for i in files:
        i = i.replace("\\", "/")
        # i = i.replace("output", "update")
        f.append(i)
    return f

测试监测 `RegexChecker.py`

import ast
import json
import os
import sys
from Executor import execute_python_code
import re
from pathlib import Path

def extract_python_code_from_md(md_content: str) -> list:
    """使用正则表达式提取所有Python代码块"""
    pattern = r'```python\n(.*?)```'
    matches = re.findall(pattern, md_content, re.DOTALL)
    return [match.strip() for match in matches]

def get_all_reports(dir_path: str) -> list:
    """获取指定目录下的所有报告
    按文件名中的数字序号排序（report_0.md, report_1.md...）
    Args:
        dir_path: 要扫描的目录路径

    Returns:
        list: 按数字排序的完整文件路径列表
        
    Example:
        >>> get_all_reports("./reports")
        [
            '/path/report_0.md',
            '/path/report_1.md',
            '/path/report_2.md'
        ]
    """
    file_list = []
    pattern = re.compile(r'^report_(\d+)\.md$')  # 精确匹配文件名格式
    try:
        for filename in os.listdir(dir_path):
            # 组合完整路径并验证文件类型
            full_path = os.path.join(dir_path, filename)
            if not os.path.isfile(full_path):
                continue
            
            # 匹配文件名格式
            match = pattern.match(filename)
            if match:
                # 提取数字并存储元组（数字转为int类型用于排序）
                file_list.append((
                    int(match.group(1)),  # 数字部分
                    full_path            # 完整路径
                ))
        file_list.sort(key=lambda x: x[0])
        rename_lst = [item[1].replace("report_", "opt_") for item in file_list]
        rename_lst = [item.replace(".md", ".py") for item in rename_lst]
        rename_lst = [item.replace("gen/reports", "opt") for item in rename_lst]

        return [item[1] for item in file_list], rename_lst
    
    except FileNotFoundError:
        print(f"错误：目录不存在 {dir_path}")
        return [], []
    except PermissionError:
        print(f"错误：无权限访问目录 {dir_path}")
        return [], []
    except Exception as e:
        print(f"未知错误：{str(e)}")
        return [], []

class ExtractedCodes:
    def __init__(self):
        self.main_function = []
        self.libs = []
        self.func_fields = []
        self.param_fields = []
        self.testing_main_function = []
    
    def get_libs(self, code: str) -> list:
        """提取所有库导入代码"""
        pattern = r"import\s+([\w\.]+)"
        matches = re.findall(pattern, code)
        return matches

    def get_main_function(self, code: str) -> str:
        """提取main函数代码"""
        pattern = r"if __name__\s*==\s*['\"]__main__['\"]\s*:\s*\n((?:^\s+.*\n?)*)"
        match = re.search(pattern, code, flags=re.MULTILINE | re.IGNORECASE)
        if match:
            main_block = match.group(0)
            return main_block.strip()
        else:
            return ""
            
    def get_param_field(self, code: str) -> dict:
        # 找到 patterns = { ... } 中的内容，处理嵌套情况
        # pattern = r"patterns\s*=\s*\{(.*?)\}"
        depth = 0
        start_index = code.find("{")
        end_index = start_index
        for i in range(start_index + 1, len(code)):
            if code[i] == "{":
                depth += 1
            elif code[i] == "}":
                if depth == 0:
                    end_index = i
                    break
                depth -= 1

        patterns_str = code[start_index + 1:end_index]
        results = {}
        if patterns_str:
            key_value_pattern = r"'([^']+)'\s*:\s*r'([^']+)'"
            key_value_matches = re.findall(key_value_pattern, patterns_str)    
            for key, value in key_value_matches:
                results[key] = value
        return results

    def extract_functions(self, code_str: str) -> dict:
        """
        从Python代码字符串中提取特定格式函数，返回字典结构
        参数：
        code_str : 需要解析的Python代码字符串
        返回：
        {
            "_compile_regex": "@lru_cache(...)\ndef _compile_regex(...): ...",
            "extract_date": "@lru_cache(...)\ndef extract_date(...): ...",
            ...
        }
        """
        code_str = re.sub(r'\"\"\".*?\"\"\"', '', code_str, flags=re.DOTALL)  # 去除文档字符串
        code_str = re.sub(r'#.*', '', code_str)  # 去除单行注释
        # 使用正则表达式匹配目标函数结构
        # 匹配装饰器部分（如果有），函数定义以及函数体
        pattern = re.compile(
            r'(@lru_cache\(.*?\)\s+)?'  # 匹配装饰器部分（如果有）
            r'def\s+(?!get_components\b)(\w+)'  # 排除get_components的其他函数
            r'\(.*?\):'  # 匹配函数头，包括参数部分
            r'([\s\S]+?)(?=\n\s*def\s+|\n@lru_cache\(.*?\)\s*|$)',  # 捕获函数体，直到下一个函数定义或装饰器
            flags=re.DOTALL
        )
        functions = {}
        # 查找所有匹配项
        for match in re.finditer(pattern, code_str):
            decorator, func_name, body = match.groups()
            # 构造函数定义
            if decorator:
                func_def = f"{decorator}def {func_name}{body}"
            else:
                func_def = f"def {func_name}{body}"
            functions[func_name] = func_def.strip()
        return functions

    def rewrite_codes(self, log_text: str, code_str: str) -> str:
        self.main_function = self.get_main_function(code_str)
        if self.main_function:
            # 双重转义处理：{}和反斜杠
            escaped_log = log_text.replace('{', '{{').replace('}', '}}').replace('\\', '\\\\')
            
            # 动态匹配所有引号类型的正则
            pattern = r'(log_text\s*=\s*)(["\'])(.*?)(?<!\\)\2'
            
            # 构造替换模板
            replacement = rf'\1f"""{escaped_log}"""'
            
            new_main_function = re.sub(
                pattern,
                replacement,
                self.main_function,
                flags=re.DOTALL
            )
            return code_str.replace(self.main_function, new_main_function)
        return code_str

def is_perfect_match(original, test):
    """完全匹配：所有字段的key和value都正确且数量一致"""
    if len(original) != len(test):
        return False  # 字段数量不一致直接判定不匹配
    
    original_dict = {f['key']: f['value'] for f in original}
    test_dict = {f['key']: f['value'] for f in test}
    return original_dict == test_dict  # 字典比对自动校验key-value对

def has_any_match(original, test):
    """至少有一个字段的key和value都正确"""

    original_set = {(f['key'], f['value']) for f in original}
    test_set = {(f['key'], f['value']) for f in test}
    return len(original_set & test_set) > 0  # 集合交集判断

def calculate_coverage(original, testing):
    if original and testing:
        original_values = {item["value"] for item in original}
        testing_values = {item["value"] for item in testing}
        common = original_values & testing_values
        c = len(common) / len(original_values)
        return round(c, 2)
    else:
        return 0.0
    
def get_testing_result(opt_path, log_text, opt_code, obj):
    # log_text = "<21>Aug 13 09:08:09 soc-32 ntpdate[187386]: adjust time server 120.25.115.20 offset 0.002019 sec" 
    new_code = obj.rewrite_codes(log_text, opt_code)
    # print(new_code)
    new_code_path = opt_path.replace("opt", "test")
    with open(new_code_path, "w", encoding="utf-8") as f:
        f.write(new_code)
    result = execute_python_code(new_code_path)
    return result
                     

def get_json_dict(text):
    # 如果是json格式的标准字符串即可，直接打印出现单引号不好处理

    # 这里全是单引号需要修复的bug...
    # 将文本中的单引号替换为双引号 但是如果是字符串中的单引号则不替换
    # print(f"init data: {text}\n")
    # valid_json = text.replace("'", "\"")
    # # 去掉连续""
    # valid_json = re.sub(r'"value":\s*""(.*?)""', r'"value": "\1"', valid_json)
    # print(f"valid data: {valid_json}\n")
    # # 非法字符 None True
    # valid_json = valid_json.replace("None", "null").replace("True", "true")
    # data = json.loads(valid_json)
    # # 返回转换后的JSON字典
    # print(type(data))
    data = json.loads(text)
    return data

class TeeStream:
    # 初始化函数，传入文件路径和标准输出流
    def __init__(self, file_path, stdout):
        # 打开文件，以写入模式，编码为utf-8
        self.file = open(file_path, 'w', encoding='utf-8')
        # 保存标准输出流
        self.stdout = stdout

    # 写入函数，传入要写入的文本
    def write(self, text):
        # 将文本写入标准输出流
        self.stdout.write(text)
        # 将文本写入文件
        self.file.write(text)

    # 刷新函数，刷新标准输出流和文件
    def flush(self):
        # 刷新标准输出流
        self.stdout.flush()
        # 刷新文件
        self.file.flush()

    # 关闭函数，关闭文件
    def close(self):
        self.file.close()
        
def TestUnit(class_dataset_path, output_dir, tag):
    tee = TeeStream(f"src/LogParserX/output/{tag}.txt", sys.stdout)
    original_stdout = sys.stdout
    sys.stdout = tee

    with open(class_dataset_path, "r", encoding="utf-8") as f:
        data_set = json.load(f)
    testing_data = data_set[:50]
    scores = []
    match_rate = 0.0
    perfect_match_rate = 0.0
    # report -> /gen/report_0.md, rename -> /gen/opt_0.py, new_code -> /test/opt_0.py
    report_list, rename_list = get_all_reports(output_dir)
    for i, j in zip(report_list, rename_list):
        code_path = Path(i).read_text()
        codes = extract_python_code_from_md(code_path)
        with open(j, "w", encoding="utf-8") as f:
            f.write(codes[0])
        idx = i.split("\\")[-1].split("_")[1].replace(".md", "")
        idx = int(idx) % 100
        print(idx)
        score = 0.0
        testing_id = testing_data[idx]["logId"]
        testing_logText = testing_data[idx]["logText"]
        testing_logField = testing_data[idx]["logField"]
        obj = ExtractedCodes()
        gen_result = get_testing_result(j, testing_logText, codes[0], obj)
        gen_result = gen_result["output"]
        # print(f"gen_result = {gen_result}\n")
        gen_result = get_json_dict(gen_result)
        # print(gen_result)
        # 验证结果
        print(f"Testing ID: {testing_id}:")
        print(f"Testing LogText: {testing_logText}")
        print(f"Testing LogField: {testing_logField}")
        print(f"Generated LogField: {gen_result}")
        if is_perfect_match(testing_logField, gen_result):
            print(f"完全匹配！")
            score = 1.0
            perfect_match_rate += 1.0
            match_rate += 1.0
        elif has_any_match(testing_logField, gen_result):
            coverage = calculate_coverage(testing_logField, gen_result)
            score = coverage
            match_rate += 1.0
            print(f"至少有一个匹配！full_coverage: {coverage*100:.2f}%")
        else:
            print(f"完全不匹配！")

        scores.append(score)
    bad_len = len([i for i in scores if i < 0.7])
    official_score = 0.4 * match_rate / len(rename_list) + 0.6 * perfect_match_rate / len(rename_list)
    print(f"{70*'='}")
    print(f"My Scores (1 for full): {scores}")
    print(f"My Average Score: {round(sum(scores) / len(scores), 2)}")
    print(f"Match Rate:  {match_rate / len(rename_list)}")
    print(f"Perfect Match Rate: {perfect_match_rate / len(rename_list)}")
    print(f"Official Score (1 for full): {official_score}")
    print(f"Bad Case: {bad_len}")

    sys.stdout = original_stdout


def MultiTestUnit(class_dataset_path: str, output_dir: str):
    tee = TeeStream("src/LogParserX/output/result_multi.txt", sys.stdout)
    original_stdout = sys.stdout
    sys.stdout = tee

    with open(class_dataset_path, "r", encoding="utf-8") as f:
            data_set = json.load(f)
    testing_data = data_set[:50]
    # scores = []
    # match_rate = 0.0
    # perfect_match_rate = 0.0
    # report -> /gen/report_0.md, rename -> /gen/opt_0.py, new_code -> /test/opt_0.py
    report_list, rename_list = get_all_reports(output_dir)
    k = 0
    for i, j in zip(report_list, rename_list):
        code_path = Path(i).read_text()
        codes = extract_python_code_from_md(code_path)
        with open(j, "w", encoding="utf-8") as f:
            f.write(codes[0])
        scores = []
        match_rate = 0.0
        perfect_match_rate = 0.0
        for idx in range(0, len(rename_list)):
            score = 0.0
            testing_id = testing_data[idx]["logId"]
            testing_logText = testing_data[idx]["logText"]
            testing_logField = testing_data[idx]["logField"]
            obj = ExtractedCodes()
            gen_result = get_testing_result(j, testing_logText, codes[0], obj)
            gen_result = gen_result["output"]
            gen_result = get_json_dict(gen_result)
            # 验证结果
            # print(f"Testing ID: {testing_id}:")
            # print(f"Testing LogText: {testing_logText}")
            # print(f"Testing LogField: {testing_logField}")
            # print(f"Generated LogField: {gen_result}")
            if is_perfect_match(testing_logField, gen_result):
                # print(f"完全匹配！")
                score = 1.0
                perfect_match_rate += 1.0
                match_rate += 1.0
            elif has_any_match(testing_logField, gen_result):
                coverage = calculate_coverage(testing_logField, gen_result)
                score = coverage
                match_rate += 1.0
                # print(f"至少有一个匹配！full_coverage: {coverage:.2f}%")
            else:
                # print(f"完全不匹配！")
                pass
            scores.append(score)
        print(f"Index: {k} {70*'='}")
        official_score = 0.4 * match_rate / len(rename_list) + 0.6 * perfect_match_rate / len(rename_list)
        print(f"My Scores (1 for full): {scores}")
        print(f"My Average Score: {round(sum(scores) / len(scores), 2)}")
        print(f"Match Rate:  {match_rate / len(rename_list)}")
        print(f"Perfect Match Rate: {perfect_match_rate / len(rename_list)}")
        print(f"Official Score (1 for full): {official_score}")
        k+=1

    sys.stdout = original_stdout



def Selector(num, class_t, output_dir):
    if num == 1:    
        TestUnit(class_dataset_path=f"data/classified_data/{class_t}.json", output_dir=output_dir, tag="result_ori")
    elif num == 2:
        TestUnit(class_dataset_path=f"data/generated_data/{class_t}.json", output_dir=output_dir, tag="result")
    elif num == 3:
        MultiTestUnit(class_dataset_path=f"data/classified_data/{class_t}.json", output_dir=output_dir)
        MultiTestUnit(class_dataset_path=f"data/generated_data/{class_t}.json", output_dir=output_dir)

if __name__ == "__main__":
    # TestUnit
    class_dataset_path = "data/generated_data/class_2.json"
    # class_dataset_path = "data/classified_data/class_2.json"
    output_dir = "src/LogParserX/output/gen/reports"
    # TestUnit: for one code, testing corresponding log to see if it can match 1->1
    # TestUnit(class_dataset_path=class_dataset_path, output_dir=output_dir)
    # MultiTestUnit: for one code, testing num sample log to testing its coverage 1->N
    # MultiTestUnit(class_dataset_path=class_dataset_path, output_dir=output_dir)

    Selector(1, "class_2", output_dir)
    Selector(2, "class_2", output_dir)