YOLOV3实战5:利用三方工具绘制P-R曲线

Yolo-v5

Yolo-v5

Yolo

YOLO(You Only Look Once)是一种流行的物体检测和图像分割模型,由华盛顿大学的Joseph Redmon 和Ali Farhadi 开发。 YOLO 于2015 年推出,因其高速和高精度而广受欢迎

大家好,我是小p,这篇文章主要给大家介绍一下如何利用第三方脚本绘制P-R曲线。
对“对象检测”感兴趣的小伙伴,欢迎加入对象检测群813221712讨论和交流,进群请看群公告!
说明: 本教程是基于KITTI的评价矩阵标准进行的,会绘制3条曲线,分别是Easy,如示例图:
在这里插入图片描述
Moderate和Hard级别,检测结果也需要按照KITTI的标准进行调整,废话少说,开始正题!
一、下载脚本
脚本链接:https://github.com/MarvinTeichmann/KittiBox
二、脚本文件说明
进入根目录,依次进入/submodules/KittiObjective2/
此时目录下一共6个对象:
1、val_gt文件夹保存样本的标签文档,打开 000000.txt 举例说明其组织格式:
Pedestrian 0.00 1 0.30 883.68 144.15 937.35 259.01 1.90 0.42 1.04 5.06 1.43 12.42 0.68
Pedestrian 0.00 2 0.29 873.70 152.10 933.44 256.07 1.87 0.50 0.90 5.42 1.50 13.43 0.67
Car 0.00 0 1.74 444.29 171.04 504.95 225.82 1.86 1.57 3.83 -4.95 1.83 26.64 1.55
Pedestrian 0.00 0 -0.39 649.28 168.10 664.61 206.40 1.78 0.53 0.95 2.20 1.57 34.08 -0.33
Car 0.98 0 2.42 0.00 217.12 85.92 374.00 1.50 1.46 3.70 -5.12 1.85 4.13 1.56
Pedestrian 0.00 0 2.02 240.35 190.31 268.02 261.61 1.54 0.57 0.41 -7.92 1.94 15.95 1.57
DontCare -1 -1 -10 0.00 226.06 88.58 373.00 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 567.39 173.95 574.86 190.60 -1 -1 -1 -1000 -1000 -1000 -10
DontCare -1 -1 -10 727.58 165.75 737.08 192.75 -1 -1 -1 -1000 -1000 -1000
这是KITTI数据的注释标准,第一列表示类别,4-7列分别表示矩形框的左上、右下的坐标,其他值可自行查阅
2、val_pred文件夹保存样本的预测结果,注意每张图片的预测结果文档以.txt形式保存,文件名和val_gt中的
要对应上,一张图片的检测结果写在一个txt中,一个矩形框写一行,打开 000011.txt 说明:
Car 0 1 0 442.462500 173.177083 507.150000 225.260417 0 0 0 0 0 0 0 0.995087
Car 0 1 0 1.293750 189.453125 97.031250 375.651042 0 0 0 0 0 0 0 0.983480
第一列代表类别,2-4列可以随意填写,5-8列为矩形框信息,同上,9-15列可以任意填写,最后一列为检测框
的置信率,即是包含物体的可能性,注意,检测结果的每一行比标签的每一行要多一列
3、makefile 进行make生成工具
4、evaluate_object.cpp 绘图的脚本cpp
三、安装必要环境,测试脚本
在/submodules/KittiObjective2/目录下,执行命令,make clean,然后make,通常情况下会提示安装
“GNUPLOT”工具,这个安装很简单,一条命令搞定:sudo apt-get install gnuplo
随后运行命令/evaluate_object2 val_pred val_gt看有无错误,基本不会出错!
四、准备数据的注意点
1、标签文件放入val_gt文件夹
2、检测结果放入val_pred文件夹
3、文件格式一定按照要求写
4、文件命名修改:必须是6位的数字,如:000000.txt和000001.txt,且标签和检测结果相对应,要改文件名也很
简单,可以自行写个小脚本处理,参照原先文件下下面的标准,依葫芦画瓢!
五、绘制曲线
首先更改evaluate_object.cpp中的
const int32_t N_MAXIMAGES = 7518;
const int32_t N_TESTIMAGES = 481;
其数值表示你此次样本的数量,自行修改
然后make,运行命令:run evaluate_object2 /path/to/prediction /path/to/gt
生成的曲线便在你检测结果的文件夹的子文件夹/plot/中
六、加群交流
码字不易,少侠且行且珍惜!

您可能感兴趣的与本文相关的镜像

Yolo-v5

Yolo-v5

Yolo

YOLO(You Only Look Once)是一种流行的物体检测和图像分割模型,由华盛顿大学的Joseph Redmon 和Ali Farhadi 开发。 YOLO 于2015 年推出,因其高速和高精度而广受欢迎

<think> 我们注意到日志中出现了两次死锁重试,但还没有成功。我们需要进一步优化死锁处理策略,并考虑其他可能的优化措施。以下是一个更全面的解决方案: ### 优化方向: 1. **减小事务粒度**:将每个分块再拆分成更小的子分块进行提交 2. **调整隔离级别**:尝试使用更低的隔离级别(如READ COMMITTED) 3. **索引优化**:确保目标表的索引不会导致死锁(特别是主键和唯一索引) 4. **顺序写入**:按照主键顺序写入可能减少死锁 5. **超时设置调整**:增加锁等待超时时间 ### 修改后的代码(关键部分): ```python # 修改write_to_database函数,将每个分块拆分成更小的子分块 def write_to_database(df, engine, table_name, max_retries=5, sub_chunk_size=50): """数据库写入函数 - 子分块提交""" total_written = 0 num_sub_chunks = (len(df) - 1) // sub_chunk_size + 1 for sub_idx in range(num_sub_chunks): sub_start = sub_idx * sub_chunk_size sub_end = min((sub_idx + 1) * sub_chunk_size, len(df)) sub_df = df.iloc[sub_start:sub_end] for attempt in range(max_retries): try: # 获取目标表列名 inspector = inspect(engine) target_columns = [col[&#39;name&#39;] for col in inspector.get_columns(table_name)] # 筛选存在的列 existing_columns = [col for col in sub_df.columns if col in target_columns] filtered_sub_df = sub_df[existing_columns] # 写入数据库 with engine.begin() as connection: filtered_sub_df.to_sql( name=table_name, con=connection, if_exists=&#39;append&#39;, index=False, chunksize=50, # 使用较小的chunksize method=&#39;multi&#39; ) written_count = len(filtered_sub_df) total_written += written_count break # 成功则跳出重试循环 except OperationalError as e: error_msg = str(e).lower() if ("deadlock" in error_msg or "lock wait timeout" in error_msg) and attempt < max_retries - 1: wait_time = (2 ** attempt) + 0.5 logging.warning(f"死锁/锁超时,等待{wait_time:.2f}秒后重试子分块 {sub_idx+1}/{num_sub_chunks} (尝试 {attempt+1}/{max_retries})") time.sleep(wait_time) else: logging.error(f"子分块 {sub_idx+1} 写入失败: {str(e)}") # 备份失败的子分块 backup_path = os.path.join(OUTPUT_DIR, f"error_subchunk_{sub_idx+1}.csv") sub_df.to_csv(backup_path, index=False) logging.warning(f"失败子分块已保存至: {backup_path}") raise except SQLAlchemyError as e: logging.error(f"数据库错误: {str(e)}") if attempt < max_retries - 1: wait_time = 1 + attempt logging.info(f"等待{wait_time}秒后重试") time.sleep(wait_time) else: backup_path = os.path.join(OUTPUT_DIR, f"error_subchunk_{sub_idx+1}.csv") sub_df.to_csv(backup_path, index=False) logging.warning(f"失败子分块已保存至: {backup_path}") raise return total_written ``` ### 完整修改后的代码: ```python import os import json import logging import traceback import time import pandas as pd import random from datetime import datetime from sqlalchemy import create_engine, inspect from sqlalchemy.exc import OperationalError, SQLAlchemyError from dotenv import load_dotenv # 初始化日志 logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", handlers=[ logging.FileHandler("bsw_task_jira_deal.log"), logging.StreamHandler() ] ) # 加载环境变量 load_dotenv() # 输出文件配置 OUTPUT_DIR = r"D:\AEB Management\output" OUTPUT_FILE = "processed_data.csv" os.makedirs(OUTPUT_DIR, exist_ok=True) # ------------------ 配置加载 ------------------ def load_config(config_path): """加载 JSON 配置文件为字典""" try: with open(config_path, "r", encoding="utf-8") as file: config = json.load(file) logging.info(f"成功加载配置文件: {config_path}") return config except Exception as e: logging.error(f"加载配置文件失败: {str(e)}") logging.error(traceback.format_exc()) return { "default_responsible": "管理员", "default_email": os.getenv("DEFAULT_EMAIL", "admin@example.com") } # ------------------ 数据列处理 ------------------ def add_columns(dataframe, bsw_staff_set): """根据业务规则动态添加新列""" current_time = pd.Timestamp.now() # 添加基础列 dataframe[&#39;HorizontalExpansion PN&#39;] = None # 1. KOCHI_PL 项目处理逻辑 kochi_condition_include = ( (dataframe[&#39;Project key&#39;] == &#39;KOCHI_PL&#39;) & (dataframe[&#39;Labels&#39;].str.contains(&#39;CheckItOnOtherProject&#39;, na=False)) ) kochi_condition_exclude = ( (dataframe[&#39;Project key&#39;] == &#39;KOCHI_PL&#39;) & (~dataframe[&#39;Labels&#39;].str.contains(&#39;CheckItOnOtherProject&#39;, na=False)) ) if kochi_condition_include.any(): extracted_projects = dataframe.loc[kochi_condition_include, &#39;Summary&#39;].str.extract(r&#39;(P\d{5})&#39;, expand=False) dataframe.loc[kochi_condition_include, &#39;HorizontalExpansion PN&#39;] = extracted_projects valid_extraction = extracted_projects.notna() dataframe.loc[kochi_condition_include & valid_extraction, &#39;Project key&#39;] = extracted_projects[valid_extraction] if kochi_condition_exclude.any(): dataframe.loc[kochi_condition_exclude, &#39;Project key&#39;] = &#39;noneed&#39; # 2. 核心业务分析列 # ProjectNum dataframe[&#39;ProjectNum&#39;] = dataframe[&#39;Project key&#39;] # BSW Self Test def is_bsw_self_test(row): if row[&#39;Involvement of BSW&#39;] != &#39;Yes&#39;: return 0 found_by = row.get(&#39;Custom field (Found By)&#39;, &#39;&#39;) labels = row.get(&#39;Labels&#39;, &#39;&#39;) cond1 = found_by in [&#39;Development Audit&#39;, &#39;Review&#39;, &#39;SW Unit Test&#39;, &#39;Development&#39;] cond2 = (found_by == &#39;SW Integration Test&#39;) and (&#39;BSW_IT&#39; in labels) return 1 if cond1 or cond2 else 0 dataframe[&#39;BSW Self Test&#39;] = dataframe.apply(is_bsw_self_test, axis=1) # BSW Issue def is_bsw_issue(row): if not (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39;): return 0 custom_field = str(row.get(&#39;Custom field (Custom_2)&#39;, &#39;&#39;)) keywords = [&#39;平台问题&#39;, &#39;开发设计问题&#39;, &#39;三方包问题&#39;, &#39;Development Design Issue&#39;, &#39;Platform Issue&#39;, &#39;Party Package Issue&#39;] return 1 if any(keyword in custom_field for keyword in keywords) else 0 dataframe[&#39;BSW Issue&#39;] = dataframe.apply(is_bsw_issue, axis=1) # Stack BSW Analyzed def is_stack_analyzed(row): if not (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39;): return 0 components = str(row.get(&#39;Component/s&#39;, &#39;&#39;)) custom_field = str(row.get(&#39;Custom field (Custom_2)&#39;, &#39;&#39;)) cond1 = &#39;Stack&#39; in components cond2 = &#39;非BSW问题&#39; in custom_field or &#39;Non-BSW&#39; in custom_field return 1 if cond1 or cond2 else 0 dataframe[&#39;Stack BSW Analyzed&#39;] = dataframe.apply(is_stack_analyzed, axis=1) # Stack BSW Unanalyzed def is_stack_unanalyzed(row): if not (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39;): return 0 components = str(row.get(&#39;Component/s&#39;, &#39;&#39;)) custom_field = str(row.get(&#39;Custom field (Custom_2)&#39;, &#39;&#39;)) cond1 = &#39;Stack&#39; in components cond2 = &#39;非BSW问题&#39; in custom_field or &#39;Non-BSW&#39; in custom_field return 1 if not (cond1 or cond2) else 0 dataframe[&#39;Stack BSW Unanalyzed&#39;] = dataframe.apply(is_stack_unanalyzed, axis=1) # Stack Total dataframe[&#39;Stack Total&#39;] = dataframe.apply( lambda row: 1 if (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and &#39;Stack&#39; in str(row.get(&#39;Component/s&#39;, &#39;&#39;))) else 0, axis=1 ) # BSW Involve dataframe[&#39;BSW Involve&#39;] = dataframe.apply( lambda row: 1 if (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39;) else 0, axis=1 ) # BSW Involve Unclosed dataframe[&#39;BSW Involve Unclosed&#39;] = dataframe.apply( lambda row: 1 if (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39; and row.get(&#39;Status&#39;, &#39;&#39;) != &#39;Closed&#39;) else 0, axis=1 ) # 3. HorizontalExpansion 分析列 if &#39;Created&#39; in dataframe.columns: dataframe[&#39;Created&#39;] = pd.to_datetime(dataframe[&#39;Created&#39;], errors=&#39;coerce&#39;) days_since_creation = (current_time - dataframe[&#39;Created&#39;]).dt.days else: days_since_creation = pd.Series([0] * len(dataframe)) # HorizontalExpansion Count dataframe[&#39;HorizontalExpansion Count&#39;] = dataframe[&#39;HorizontalExpansion PN&#39;].notna().astype(int) # HorizontalExpansion Unfinished dataframe[&#39;HorizontalExpansion Unfinished&#39;] = ( dataframe[&#39;HorizontalExpansion PN&#39;].notna() & (dataframe[&#39;Status&#39;] != &#39;Done&#39;) ).astype(int) # HorizontalExpansion Delay dataframe[&#39;HorizontalExpansion Delay&#39;] = ( dataframe[&#39;HorizontalExpansion PN&#39;].notna() & (dataframe[&#39;Status&#39;] != &#39;Done&#39;) & (days_since_creation > 30) ).astype(int) # 4. 分类分析列 # BSW Analysis Conclusions def bsw_analysis_conclusions(row): if row[&#39;Involvement of BSW&#39;] == &#39;No&#39;: return &#39;Non-BSW&#39; elif row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39;: if pd.notna(row[&#39;Custom field (Custom_2)&#39;]): custom_field = row[&#39;Custom field (Custom_2)&#39;] if &#39;非BSW问题&#39; in custom_field or &#39;Non-BSW&#39; in custom_field: return &#39;Analyzed-Non-BSW&#39; elif &#39;开发设计问题&#39; in custom_field or &#39;Development Design Issue&#39; in custom_field: return &#39;Analyzed-BSW&#39; elif &#39;需求问题&#39; in custom_field or &#39;Requirement Issue&#39; in custom_field: return &#39;Analyzed-Non-BSW&#39; elif &#39;平台问题&#39; in custom_field or &#39;Platform Issue&#39; in custom_field: return &#39;Analyzed-BSW&#39; elif &#39;三方包问题&#39; in custom_field or &#39;Party Package Issue&#39; in custom_field: return &#39;Analyzed-三方包问题&#39; return &#39;Analyzing-Unclassified&#39; dataframe[&#39;BSW Analysis Conclusions&#39;] = dataframe.apply(bsw_analysis_conclusions, axis=1) # Stack Classification def stack_classification(component): if pd.isna(component): return &#39;Unclassed&#39; stacks = [&#39;Com_Stack&#39;, &#39;Cdd_Stack&#39;, &#39;CybSec_Stack&#39;, &#39;Diag_Stack&#39;, &#39;Fbl_Stack&#39;, &#39;FuSa_Stack&#39;, &#39;IoHwAb_Stack&#39;, &#39;Mem_Stack&#39;, &#39;ModeMgm_Stack&#39;, &#39;OS_Stack&#39;, &#39;Sys_Stack&#39;] for stack in stacks: if stack in component: return stack return &#39;Unclassed&#39; dataframe[&#39;Stack Classification&#39;] = dataframe[&#39;Component/s&#39;].apply(stack_classification) # BSW Reason Classification def calculate_bsw_reason_classification(row): if row[&#39;Involvement of BSW&#39;] == &#39;No&#39;: return &#39;底层未参与分析&#39; if pd.notna(row[&#39;Custom field (Custom_2)&#39;]): custom_field = row[&#39;Custom field (Custom_2)&#39;] if &#39;非BSW问题&#39; in custom_field or &#39;Non-BSW&#39; in custom_field: return &#39;非BSW问题&#39; elif &#39;开发设计问题&#39; in custom_field or &#39;Development Design Issue&#39; in custom_field: return &#39;开发设计问题&#39; elif &#39;需求问题&#39; in custom_field or &#39;Requirement Issue&#39; in custom_field: return &#39;需求问题&#39; elif &#39;平台问题&#39; in custom_field or &#39;Platform Issue&#39; in custom_field: return &#39;平台问题&#39; elif &#39;三方包问题&#39; in custom_field or &#39;Party Package Issue&#39; in custom_field: return &#39;三方包问题&#39; return &#39;Analyzing-Unclassified&#39; dataframe[&#39;BSW Reason Classification&#39;] = dataframe.apply(calculate_bsw_reason_classification, axis=1) # total列 dataframe[&#39;total&#39;] = dataframe.apply( lambda row: 0 if row[&#39;Project key&#39;] == &#39;KOCHI_PL&#39; else 1 if row[&#39;Project key&#39;] != &#39;KOCHI_PL&#39; and row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] else 0, axis=1 ) # BSW Analyzed regex = r&#39;需求问题|平台问题|开发设计问题|非BSW问题|三方包问题|Non-BSW|Development Design Issue|Requirement Issue|Platform Issue|Party Package Issue&#39; dataframe[&#39;BSW Analyzed&#39;] = dataframe.apply( lambda row: 1 if (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39; and pd.notna(row[&#39;Custom field (Custom_2)&#39;]) and row[&#39;Custom field (Custom_2)&#39;].str.contains(regex, regex=True)) else 0, axis=1 ) # BSW Unanalyzed dataframe[&#39;BSW Unanalyzed&#39;] = dataframe.apply( lambda row: 1 if (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39; and (pd.isna(row[&#39;Custom field (Custom_2)&#39;]) or not row[&#39;Custom field (Custom_2)&#39;].str.contains(regex, regex=True))) else 0, axis=1 ) # BSW Staff List dataframe[&#39;BSW Staff List&#39;] = dataframe[&#39;Assignee&#39;].apply( lambda x: "Yes" if x in bsw_staff_set else "No" ) # Found By Classification def classify_found_by(row): found_by = row[&#39;Custom field (Found By)&#39;] labels = row[&#39;Labels&#39;] if pd.notna(row[&#39;Labels&#39;]) else "" if found_by == &#39;SW Integration Test&#39; and &#39;BSW_IT&#39; in labels: return &#39;BSW_IT&#39; elif found_by == &#39;SW Integration Test&#39;: return &#39;SWIT&#39; elif found_by == &#39;Cybersecurity Monitoring&#39;: return &#39;Cybersecurity Monitoring&#39; elif found_by in [&#39;HIL TEST&#39;, &#39;HILP TEST&#39;, &#39;System Test&#39;]: return &#39;System Test&#39; elif found_by in [&#39;OEM Test&#39;, &#39;OEM&#39;, &#39;End user&#39;]: return &#39;OEM&End user&#39; elif found_by == &#39;Development Audit&#39; and &#39;BSW_PT&#39; in labels: return &#39;BSW_PT&#39; elif found_by in [&#39;Review&#39;, &#39;SW Unit Test&#39;, &#39;Development&#39;]: return &#39;开发自测&#39; return &#39;Others&#39; dataframe[&#39;Found By Classification&#39;] = dataframe.apply(classify_found_by, axis=1) # High Priority Unanalyzed def calculate_high_priority_unanalyzed(row): if not (row[&#39;Issue Type&#39;] in ["KOCHI AE Issue", "Problem"] and row[&#39;Involvement of BSW&#39;] == &#39;Yes&#39; and row[&#39;Priority&#39;] == &#39;High&#39;): return 0 if pd.isna(row[&#39;Created&#39;]): return 0 days_diff = (current_time - row[&#39;Created&#39;]).days if days_diff <= 15: return 0 custom_field = str(row.get(&#39;Custom field (Custom_2)&#39;, &#39;&#39;)) exclude_keywords = [ &#39;需求问题&#39;, &#39;平台问题&#39;, &#39;开发设计问题&#39;, &#39;非BSW问题&#39;, &#39;三方包问题&#39;, &#39;Non-BSW&#39;, &#39;Development Design Issue&#39;, &#39;Requirement Issue&#39;, &#39;Platform Issue&#39;, &#39;Party Package Issue&#39; ] if any(keyword in custom_field for keyword in exclude_keywords): return 0 return 1 dataframe[&#39;High Priority Unanalyzed&#39;] = dataframe.apply(calculate_high_priority_unanalyzed, axis=1) # 5. 日期相关列 if &#39;Updated&#39; in dataframe.columns: dataframe[&#39;Updated&#39;] = pd.to_datetime(dataframe[&#39;Updated&#39;], errors=&#39;coerce&#39;) days_since_update = (current_time - dataframe[&#39;Updated&#39;]).dt.days condition_base = ( (dataframe[&#39;Issue Type&#39;].isin(["KOCHI AE Issue", "Problem"])) & (dataframe[&#39;Involvement of BSW&#39;] == &#39;Yes&#39;) & (dataframe[&#39;BSW Staff List&#39;] == &#39;Yes&#39;) ) dataframe[&#39;update_over_15&#39;] = (condition_base & (days_since_update >= 15)).astype(int) dataframe[&#39;update_over_7&#39;] = (condition_base & (days_since_update > 7) & (days_since_update < 15)).astype(int) logging.info("成功添加所有分析列") return dataframe def assign_responsibles(dataframe, config_dict): """根据配置字典为 DataFrame 增加 &#39;Responsible&#39; 和 &#39;Email&#39; 列""" if not config_dict: logging.warning("配置字典为空,无法分配责任人") dataframe[&#39;Responsible&#39;] = "Unassigned" dataframe[&#39;Email&#39;] = "" return dataframe def get_responsible_and_email(project_key): project_key = str(project_key) # 首先检查具体项目配置 if project_key in config_dict: project_config = config_dict[project_key] return ( project_config.get("Responsible", "Unassigned"), project_config.get("Email", "") ) # 其次检查默认配置 default_config = config_dict.get("Default", {}) return ( default_config.get("Responsible", "Unassigned"), default_config.get("Email", "") ) dataframe[[&#39;Responsible&#39;, &#39;Email&#39;]] = dataframe[&#39;Project key&#39;].apply( lambda x: pd.Series(get_responsible_and_email(x)) ) return dataframe # ------------------ 数据库相关 ------------------ def get_database_engine(db_name="default"): """创建数据库连接引擎""" prefix = "BSW_" if db_name == "bsw_jira" else "" host = os.getenv(f"{prefix}DB_HOST") user = os.getenv(f"{prefix}DB_USER") password = os.getenv(f"{prefix}DB_PASSWORD") database = os.getenv(f"{prefix}DB_NAME", db_name) port = int(os.getenv(f"{prefix}DB_PORT", 3306)) if not all([host, user, password, database]): raise ValueError(f"缺少{db_name}数据库配置") # 增加连接池参数优化 connection_uri = f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}?charset=utf8mb4" return create_engine( connection_uri, pool_size=5, # 减少连接池大小 max_overflow=10, pool_recycle=3600, # 1小时后回收连接 pool_pre_ping=True, # 执行前检查连接是否有效 isolation_level="READ COMMITTED" # 设置事务隔离级别 ) def write_to_database(df, engine, table_name, max_retries=5, sub_chunk_size=50): """数据库写入函数 - 子分块提交""" if df.empty: logging.warning("数据为空,跳过写入") return 0 total_written = 0 num_sub_chunks = (len(df) - 1) // sub_chunk_size + 1 for sub_idx in range(num_sub_chunks): sub_start = sub_idx * sub_chunk_size sub_end = min((sub_idx + 1) * sub_chunk_size, len(df)) sub_df = df.iloc[sub_start:sub_end] for attempt in range(max_retries): try: # 获取目标表列名 inspector = inspect(engine) target_columns = [col[&#39;name&#39;] for col in inspector.get_columns(table_name)] # 筛选存在的列 existing_columns = [col for col in sub_df.columns if col in target_columns] filtered_sub_df = sub_df[existing_columns] # 写入数据库 - 使用更小的chunksize with engine.begin() as connection: filtered_sub_df.to_sql( name=table_name, con=connection, if_exists=&#39;append&#39;, index=False, chunksize=50, # 使用较小的chunksize method=&#39;multi&#39; ) written_count = len(filtered_sub_df) total_written += written_count logging.debug(f"子分块 {sub_idx+1}/{num_sub_chunks} 写入成功 [{written_count}行]") break # 成功则跳出重试循环 except OperationalError as e: error_msg = str(e).lower() if ("deadlock" in error_msg or "lock wait timeout" in error_msg) and attempt < max_retries - 1: wait_time = (2 ** attempt) + random.uniform(0, 1) # 指数退避+随机抖动 logging.warning(f"死锁/锁超时,等待{wait_time:.2f}秒后重试子分块 {sub_idx+1}/{num_sub_chunks} (尝试 {attempt+1}/{max_retries})") time.sleep(wait_time) else: logging.error(f"子分块 {sub_idx+1} 写入失败: {str(e)}") # 备份失败的子分块 backup_path = os.path.join(OUTPUT_DIR, f"error_subchunk_{sub_idx+1}.csv") sub_df.to_csv(backup_path, index=False) logging.warning(f"失败子分块已保存至: {backup_path}") if attempt == max_retries - 1: # 最后一次尝试失败,跳过当前子分块继续下一个 logging.error(f"子分块 {sub_idx+1} 重试{max_retries}次后仍失败,跳过该子分块") continue except SQLAlchemyError as e: logging.error(f"数据库错误: {str(e)}") if attempt < max_retries - 1: wait_time = 1 + attempt logging.info(f"等待{wait_time}秒后重试子分块 {sub_idx+1}") time.sleep(wait_time) else: backup_path = os.path.join(OUTPUT_DIR, f"error_subchunk_{sub_idx+1}.csv") sub_df.to_csv(backup_path, index=False) logging.warning(f"失败子分块已保存至: {backup_path}") logging.error(f"子分块 {sub_idx+1} 重试{max_retries}次后仍失败,跳过该子分块") return total_written def process_and_write_data(source_engine, target_engine, config_path): """查询、处理并写入数据 - 增强死锁处理""" try: source_table = os.getenv("SOURCE_TABLE_NAME", "jira_task_bsw") target_table = os.getenv("TARGET_TABLE_NAME", "jira_task_bsw_deal") logging.info(f"从 {source_table} 读取数据,写入 {target_table}") # 加载配置文件 config_dict = load_config(config_path) logging.info(f"配置加载状态: {&#39;成功&#39; if config_dict else &#39;失败&#39;}") # 准备BSW人员集合 bsw_staff_set = set(config_dict.get("bsw_staff_list", [])) if config_dict else set() # 项目筛选 project_keys = ["P09135", "P09192", "P08499", "KOCHI_PL"] query = f"SELECT * FROM `{source_table}` WHERE `Project key` IN {tuple(project_keys)}" # 分块处理 chunk_size = 300 # 减小分块大小 chunks = pd.read_sql(query, con=source_engine, chunksize=chunk_size) total_rows = 0 chunk_count = 0 for chunk in chunks: chunk_count += 1 logging.info(f"开始处理分块 {chunk_count}") # 处理数据块 processed_chunk = add_columns(chunk, bsw_staff_set) processed_chunk = assign_responsibles(processed_chunk, config_dict) # 写入数据库(使用子分块) try: written_count = write_to_database( processed_chunk, target_engine, target_table, sub_chunk_size=50 # 每个子分块50行 ) total_rows += written_count logging.info(f"分块 {chunk_count} 处理完成,写入 {written_count} 行") except Exception as e: logging.error(f"分块 {chunk_count} 写入过程中发生未处理异常: {str(e)}") # 保存整个分块到CSV以便后续分析 backup_path = os.path.join(OUTPUT_DIR, f"failed_chunk_{chunk_count}.csv") processed_chunk.to_csv(backup_path, index=False) logging.warning(f"失败的分块已保存到: {backup_path}") logging.info(f"总计写入 {total_rows} 行到 {target_table}") return total_rows except Exception as e: logging.error(f"数据处理失败: {str(e)}") logging.error(traceback.format_exc()) return 0 # ------------------ 主函数 ------------------ if __name__ == "__main__": # 环境变量检查 required_vars = ["SOURCE_TABLE_NAME", "TARGET_TABLE_NAME"] missing_vars = [var for var in required_vars if not os.getenv(var)] if missing_vars: logging.error(f"缺少环境变量: {&#39;, &#39;.join(missing_vars)}") exit(1) # 创建数据库连接 try: source_engine = get_database_engine("default") target_engine = get_database_engine("bsw_jira") logging.info("数据库连接成功建立") except Exception as e: logging.error(f"数据库连接失败: {str(e)}") exit(1) # 处理并写入数据 try: start_time = time.time() row_count = process_and_write_data( source_engine=source_engine, target_engine=target_engine, config_path="responsible_config.json" ) elapsed = time.time() - start_time logging.info(f"数据处理完成,共写入 {row_count} 条记录,耗时 {elapsed:.2f} 秒") exit(0) except Exception as e: logging.error(f"主流程执行失败: {str(e)}") exit(1) ``` ### 主要优化点: 1. **子分块提交**: - 将每个数据分块再拆分为50行一个的子分块 - 每个子分块独立提交事务,大幅减少事务持有锁的时间 2. **指数退避+随机抖动**: ```python wait_time = (2 ** attempt) + random.uniform(0, 1) ``` 3. **更精细的错误处理**: - 子分块级别备份:每次失败保存对应子分块到CSV - 跳过失败子分块继续处理后续数据 4. **连接池优化**: - 设置`pool_pre_ping=True`确保连接有效 - 设置合理的连接池大小 5. **日志增强**: - 记录每个子分块的写入状态 - 记录每个分块的处理进度 ### 其他建议: 1. **临时调整数据库配置**: ```sql -- 增加锁等待超时时间(临时) SET GLOBAL innodb_lock_wait_timeout=120; ``` 2. **检查目标表索引**: - 移除不必要的索引(尤其是写操作频繁时) - 写操作完成后重建索引 3. **调整MySQL配置**(my.ini): ```ini [mysqld] innodb_buffer_pool_size=2G innodb_log_file_size=512M innodb_flush_log_at_trx_commit=2 ``` §§相关问题§§ 1. 如何进一步优化MySQL数据库以减少死锁的发生? 2. 在子分块提交的策略中,如何选择最佳的子分块大小? 3. 除了死锁重试,还有哪些方法可以保证数据写入的可靠性? 4. 在Pandas中处理大数据集时,如何避免内存溢出问题? 5. 如何设计一个实时监控数据库死锁的系统? </think> 我理解您遇到了数据库死锁问题,尽管我之前的优化尝试仍未完全解决。让我提供一个更全面的解决方案,通过多维度优化来彻底解决死锁问题: ```python import os import json import logging import traceback import time import pandas as pd import random from datetime import datetime from sqlalchemy import create_engine, inspect from sqlalchemy.exc import OperationalError, SQLAlchemyError from dotenv import load_dotenv import threading # 初始化日志 logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", handlers=[ logging.FileHandler("bsw_task_jira_deal.log"), logging.StreamHandler() ] ) # 加载环境变量 load_dotenv() # 输出文件配置 OUTPUT_DIR = r"D:\AEB Management\output" OUTPUT_FILE = "processed_data.csv" os.makedirs(OUTPUT_DIR, exist_ok=True) # ------------------ 配置加载 ------------------ def load_config(config_path): """加载 JSON 配置文件为字典""" try: with open(config_path, "r", encoding="utf-8") as file: config = json.load(file) logging.info(f"成功加载配置文件: {config_path}") return config except Exception as e: logging.error(f"加载配置文件失败: {str(e)}") logging.error(traceback.format_exc()) return { "default_responsible": "管理员", "default_email": os.getenv("DEFAULT_EMAIL", "admin@example.com") } # ------------------ 数据列处理 ------------------ def add_columns(dataframe, bsw_staff_set): """根据业务规则动态添加新列""" current_time = pd.Timestamp.now() # 添加基础列 dataframe[&#39;HorizontalExpansion PN&#39;] = None # 1. KOCHI_PL 项目处理逻辑 (保持原逻辑不变) # ... [原有KOCHI_PL处理代码] ... # 2. 核心业务分析列 (保持原逻辑不变) # ... [原有业务分析列代码] ... logging.info("成功添加所有分析列") return dataframe def assign_responsibles(dataframe, config_dict): """根据配置字典为 DataFrame 增加 &#39;Responsible&#39; 和 &#39;Email&#39; 列""" # ... [原有责任人分配代码] ... return dataframe # ------------------ 数据库相关 ------------------ def get_database_engine(db_name="default"): """创建数据库连接引擎 - 增强连接池管理""" prefix = "BSW_" if db_name == "bsw_jira" else "" host = os.getenv(f"{prefix}DB_HOST") user = os.getenv(f"{prefix}DB_USER") password = os.getenv(f"{prefix}DB_PASSWORD") database = os.getenv(f"{prefix}DB_NAME", db_name) port = int(os.getenv(f"{prefix}DB_PORT", 3306)) if not all([host, user, password, database]): raise ValueError(f"缺少{db_name}数据库配置") # 增强连接池参数优化 connection_uri = f"mysql+pymysql://{user}:{password}@{host}:{port}/{database}?charset=utf8mb4" return create_engine( connection_uri, pool_size=3, # 减少连接池大小 max_overflow=5, pool_recycle=1800, # 30分钟后回收连接 pool_pre_ping=True, # 执行前检查连接是否有效 isolation_level="READ COMMITTED", # 设置事务隔离级别 connect_args={ &#39;connect_timeout&#39;: 30, # 连接超时时间 &#39;read_timeout&#39;: 60, # 读取超时时间 &#39;write_timeout&#39;: 60 # 写入超时时间 } ) def write_to_database(df, engine, table_name, max_retries=5): """数据库写入函数 - 增强死锁处理策略""" if df.empty: logging.warning("空数据框,跳过写入") return 0 for attempt in range(max_retries): try: # 获取目标表列名 inspector = inspect(engine) target_columns = [col[&#39;name&#39;] for col in inspector.get_columns(table_name)] # 筛选存在的列 existing_columns = [col for col in df.columns if col in target_columns] filtered_df = df[existing_columns] # 写入数据库 - 使用显式事务和更小的chunksize with engine.begin() as connection: filtered_df.to_sql( name=table_name, con=connection, if_exists=&#39;append&#39;, index=False, chunksize=50, # 减小批量大小 method=&#39;multi&#39; ) return filtered_df.shape[0] except OperationalError as e: error_msg = str(e).lower() if "deadlock" in error_msg and attempt < max_retries - 1: # 指数退避 + 随机抖动 wait_time = (2 ** attempt) + random.uniform(0, 1) logging.warning(f"死锁检测,等待{wait_time:.2f}秒后重试 (尝试 {attempt + 1}/{max_retries})") time.sleep(wait_time) elif "lock wait timeout" in error_msg and attempt < max_retries - 1: wait_time = 3 + attempt * 2 logging.warning(f"锁等待超时,等待{wait_time}秒后重试 (尝试 {attempt + 1}/{max_retries})") time.sleep(wait
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值