Table.Group系列_第4参数为全局的情况下,利用第5参数进行分组汇总

本文详细描述了如何使用Excel和PowerQuery实现根据筛选条件动态统计部门或职位的功能,包括构建筛选器、设置组合框控件、使用公式获取索引和内容,以及进行分组汇总和数据合并的过程。

原始数据:

  • 部门与职位存在于同一列中
    在这里插入图片描述

实现功能:

  • 根据筛选条件,可对部门或职位进行统计汇总
  • 第一列列名根据筛选自动变更,显示当前统计的维度
    在这里插入图片描述

实现方式:

1. 构建筛选器内容

  • 在任意空白单元格内输入需要筛选的内容
    在这里插入图片描述

2. 插入"组合框"控件,并进行相应设置

  • 从开发工具选项卡中插入"组合框"控件
    在这里插入图片描述
  • 右键"组合框控件",选择"设置控件格式"
  • 数据源区域 选取 第1步输入的筛选内容(红框区域)
  • 单元格链接 选取任意空白单元格,用于返回筛选内容的位置索引(绿色区域)
    在这里插入图片描述

3. 设置公式,获取筛选器所选择的值

= INDEX(Z1:Z2,W1)
  • X1 单元格为设置公式的单元格,结果如图
    在这里插入图片描述

4. 对筛选索引值(W1),筛选内容(X1)定义名称

  • 选中要设置的单元格,选择公式栏下的"定义名称"
  • 在名称中输入相应名称即可(此例W1定义为"idx",X1定义为"dept")
    在这里插入图片描述

5. 导入原始表
在这里插入图片描述
6.获取筛选索引值

  • 由于powerquery中,数据索引是从0开始,故将原先W1中的值-1
维度索引 = Excel.CurrentWorkbook(){[Name="idx"]}[Content][Column1]{0} - 1

在这里插入图片描述
7. 获取筛选内容

dept = Excel.CurrentWorkbook(){[Name="dept"]}[Content][Column1]{0}

在这里插入图片描述
8. 分组汇总
8.1 按筛选索引分组(主逻辑)

分组汇总 = Table.Group(更改的类型,
        "部门职位",                           //按部门职位字段分组
        {"xx",each _},                       //汇总后不做任何操作
        1,                                   // 全局模式
        (x,y) => Comparer.OrdinalIgnoreCase(Text.Split(x,"_"){维度索引} , Text.Split(y,"_"){维度索引}))
  • x代表当前行,y代表当前行后面的行
  • 即对部门职位列按"_"进行分组,按维度索引提取对应内容,对比每行内容是否一致,以确定最终分组
  • 图中已按职位进行了相应分组

在这里插入图片描述
8.2 分组汇总细节构建

分组汇总 = Table.Group(更改的类型,
        "部门职位",
        {"xx",each 
            Table.ReorderColumns(                                                       //数据位置重排
                #table({"部门职位","人数",dept},{{null,null,Text.Split(_[部门职位]{0},"_"){维度索引}}})  //拼接部门行
                & Table.AddColumn(_,dept,each null)                                    //增加统计维度列
                & #table({"部门职位","人数",dept},{{"小计",List.Sum(_[人数]),null}}),     //拼接小计行
                {dept,"部门职位","人数"}
            )
        },
        1,
        (x,y) => Comparer.OrdinalIgnoreCase(Text.Split(x,"_"){维度索引} , Text.Split(y,"_"){维度索引}))[xx]  //深化处理后的列
  • 通过字段添加,表合并,数据位置重排等方式构建细节汇总项目
    在这里插入图片描述
    9. 合并表数据
合并表 = Table.Combine(分组汇总)

在这里插入图片描述
10.发布上载至页面

data: lv_index type i, lv_plnnr type plnnr, lv_count type i. data: lv_conut_row type i. data: lalv_plnal type n length 2, lalv_plna2 type n length 2. if p_cb1 = 'X'. perform frm_merge_plan tables s_wer. "汇总扩展工厂 else. perform frm_merge_plan tables s_plan. endif. perform frm_expa_inspection. "处理组计数器问题 获取已存在的检验计划的最后一个计数器编号 move-corresponding gt_upload_1[] to gt_upload_temp[]. sort gt_upload_temp by plnnr ascending plnal descending . delete adjacent duplicates from gt_upload_temp comparing plnnr. sort gt_plan by plnnr. loop at gt_plan. lv_count = sy-tabix. if p_cb1 <> 'X'."针对已存在数据的扩展. if lv_count = 1. lv_plnnr = gt_plan-plnnr. else. if gt_plan-plnnr <> lv_plnnr. clear: lv_index,lalv_plnal. lv_plnnr = gt_plan-plnnr. else. lv_index = 1. endif. endif. else. if lv_count = 1. lv_index = 1. lv_plnnr = gt_plan-plnnr. read table gt_inspection with key plnnr = gt_plan-plnnr. if sy-subrc = 0. lalv_plnal = gt_inspection-plnal + 1. endif. else. if gt_plan-plnnr <> lv_plnnr. clear: lv_index,lalv_plnal. lv_plnnr = gt_plan-plnnr. endif. endif. endif. loop at gt_upload_1 where plnnr = gt_plan-plnnr. gt_alv-row = sy-tabix. move-corresponding gt_upload_1 to gt_alv. gt_alv-zplan = gt_alv-werks. "原检验计划工厂 gt_alv-werks = gt_plan-werks. "扩展工厂 read table gt_upload_c with key plnnr = gt_alv-plnnr plnal = gt_alv-plnal. if sy-subrc = 0. gt_alv-lifnr = gt_upload_c-lifnr. gt_alv-matnr = gt_upload_c-matnr. endif. gt_alv-plnal_f = gt_alv-plnal."原计划组计数器 if gt_alv-plnal_f <> ''. "组计数器 补零 自增+1 unpack gt_alv-plnal_f to gt_alv-plnal_f. endif. if lalv_plnal is not initial. gt_alv-plnal = lalv_plnal. endif. if gt_alv-plnal <> ''. "组计数器 补零 自增+1 unpack gt_alv-plnal to gt_alv-plnal. endif. if not gt_alv-verwmerkm is initial."主检验特性 补零 unpack gt_alv-verwmerkm to gt_alv-verwmerkm. endif. if gt_alv-arbpl is not initial. select single arbpl into gt_alv-arbpl from crhd where objid = gt_alv-arbpl. endif. if gt_alv-plnme is not initial. read table gt_t006a with key msehi = gt_alv-plnme spras = sy-langu. if sy-subrc = 0. gt_alv-plnme = gt_t006a-mseh3. endif. endif. if gt_alv-probemgeh is not initial. read table gt_t006a with key msehi = gt_alv-probemgeh spras = sy-langu. if sy-subrc = 0. gt_alv-probemgeh = gt_t006a-mseh3. endif. endif. gt_alv-datub = '99991231'. "获取字段“标准要求” if p_cb1 = 'X'. perform frm_get_text changing gt_alv. endif. "检查数据 perform frm_che_data changing gt_alv. perform frm_set_color changing gt_alv. at end of plnal. if p_cb1 <> 'X'."针对已存在数据的扩展 if lv_index = 1. lalv_plnal = lalv_plnal + 1. endif. else. if lv_index = 1. "针对相同的组计划 组计数器自增的开关 lalv_plnal = lalv_plnal + 1. endif. endif. endat. append gt_alv. clear gt_alv. endloop. if p_cb1 <> 'X'."针对已存在数据的扩展. if gt_plan-plnnr = lv_plnnr and lalv_plnal is initial."相同计划组 第二次循环的时候执行 read table gt_upload_temp with key plnnr = gt_plan-plnnr. if sy-subrc = 0. lalv_plnal = gt_upload_temp-plnal + 1. endif. endif. endif. endloop. *---》物料清单 loop at gt_upload_c. move-corresponding gt_upload_c to gt_upload_2. if gt_upload_2-matnr <> '' and gt_upload_2-matnr co '0123456789' . unpack gt_upload_2-matnr to gt_upload_2-matnr. elseif gt_upload_2-matnr na 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'. unpack gt_upload_2-matnr to gt_upload_2-matnr. endif. if gt_upload_2-plnal <> ''. "组计数器 补零 unpack gt_upload_2-plnal to gt_upload_2-plnal. endif. if not gt_upload_2-lifnr is initial. unpack gt_upload_2-lifnr to gt_upload_2-lifnr. endif. append gt_upload_2. clear gt_upload_2. endloop. "单独扩展 move-corresponding gt_alv[] to gt_alvk[]. sort gt_alvk by plnnr plnal werks. delete adjacent duplicates from gt_alvk comparing plnnr plnal werks. loop at gt_alvk. "显示扩展工厂字段 gt_alvk-row = sy-tabix. modify gt_alvk transporting row. endloop. "汇总检验计划行 clear gt_plnnr[]. move-corresponding gt_alvk[] to gt_plnnr[]. *-----------------------------------------------------------调用BAPI相关代码--------------------------------------------------------------------------- data : lv_toleranzun type cha_class_view-sollwert, "规范下限 lv_toleranzob type cha_class_view-sollwert. "规范上限 **CREATE-BAPI 中的任务清单抬头数据用于检验计划 data : task like table of bapi1191_tsk_c with header line . **CREATE-BAPI 中的工序用于检验批 data : operation like table of bapi1191_opr_c with header line. **检验计划 CREATE-BAPI 中的检验特性 data : insp_char like table of bapi1191_cha_c with header line. **CREATE_BAPI 中的物料任务清单用于检验计划 data : matnr_allc like table of bapi1191_mtk_c with header line . data : textallocation like table of bapi1191_txt_hdr_c with header line. data : text like table of bapi1012_txt_c with header line. **返回参数 data : retn like table of bapiret2 with header line. data : group like bapi1191_tsk_c-task_list_group . data : gp_cnter like bapi1191_tsk_c-group_counter . data:l_chk_error(1). data:begin of lt_qpmk occurs 0, zaehler like qpmk-zaehler, "主检验特性的工厂 mkmnr like qpmk-mkmnr, "主文件检验特性 version like qpmk-version, "主检验特性的版本号 steuerkz like qpmk-steuerkz, "检验特性/主检验特性的控制标识串 end of lt_qpmk. data:begin of lt_qpmz occurs 0, zaehler like qpmk-zaehler, "主检验特性的工厂 mkmnr like qpmk-mkmnr, "主文件检验特性 auswmenge1 like qpmz-auswmenge1, auswmgwrk1 like qpmz-auswmgwrk1, katalgart2 like qpmz-katalgart2, auswmenge2 like qpmz-auswmenge2, end of lt_qpmz. select zaehler mkmnr version steuerkz from qpmk into table lt_qpmk for all entries in gt_alv where zaehler = gt_alv-qpmk_werks and mkmnr = gt_alv-verwmerkm. select zaehler mkmnr auswmenge1 auswmgwrk1 katalgart2 auswmenge2 from qpmz into table lt_qpmz for all entries in gt_alv where zaehler = gt_alv-qpmk_werks and mkmnr = gt_alv-verwmerkm. sort lt_qpmk by zaehler mkmnr version descending. delete adjacent duplicates from lt_qpmk comparing zaehler mkmnr. read table gt_alv with key mtype = 'E'. if sy-subrc = 0. message s024 with '请先处理错误信息。' display like 'E'. else. loop at gt_plnnr. clear:task[],matnr_allc[],operation[],insp_char[],textallocation[],text[],l_chk_error. loop at gt_upload_2 where plnnr = gt_plnnr-plnnr and plnal = gt_plnnr-plnal_f."分配物料到任务清单 MAPL matnr_allc-task_list_group = gt_plnnr-plnnr ."检验计划组 matnr_allc-group_counter = gt_plnnr-plnal."组计数器 * matnr_allc-group_counter = gt_upload_2-plnal."组计数器 * matnr_allc-plant = gt_upload_2-werks_mt." 物料工厂 matnr_allc-plant = gt_plnnr-werks." 物料工厂 matnr_allc-material = gt_upload_2-matnr. matnr_allc-vendor = gt_upload_2-lifnr."供应商 matnr_allc-valid_from = sy-datum. append matnr_allc. clear matnr_allc. endloop. loop at gt_alv where plnnr = gt_plnnr-plnnr and plnal = gt_plnnr-plnal and werks = gt_plnnr-werks. translate gt_alv-werks to upper case. "任务清单抬头 PLKO task-plant = gt_alv-werks."工厂 task-task_list_group = gt_alv-plnnr ."检验计划组 task-group_counter = gt_alv-plnal."组计数器 task-description = gt_alv-ktext."组说明 task-task_list_usage = gt_alv-verwe."用途 if gt_alv-plnme is not initial. call function 'CONVERSION_EXIT_CUNIT_INPUT' "批量单位 exporting input = gt_alv-plnme language = sy-langu importing output = task-task_measure_unit exceptions unit_not_found = 1 others = 2. endif. task-ident_key = gt_alv-slwbez."检验点 task-valid_from = sy-datum."开始生效日期 task-task_list_status = '4'. task-dyn_modif_level = '3'. append task. clear task. operation-task_list_group = gt_alv-plnnr ."检验计划组 “工序作业 PLPO operation-group_counter = gt_alv-plnal."组计数器 operation-activity = gt_alv-vornr."工序 operation-description = gt_alv-ltxal."工序说明 operation-control_key = 'QM01'."OPERATION-STEUS."控制码 operation-work_cntr = gt_alv-arbpl."工作中心 没找到数据源 operation-plant = gt_alv-werks."工作中心的工厂 operation-valid_from = sy-datum. operation-denominator = 1."用于转换工艺路线和工序单位的分母 operation-nominator = 1. "用于转换任务清单和工序计量单位的计数器 operation-base_quantity = 1."基本数量 if gt_alv-slwbez is not initial. operation-insp_point_complt_flow_variant = '2'."检验点的评估方式 endif. if gt_alv-plnme is not initial. call function 'CONVERSION_EXIT_CUNIT_INPUT' "工序单位 exporting input = gt_alv-plnme language = sy-langu importing output = operation-operation_measure_unit exceptions unit_not_found = 1 others = 2. endif. append operation. clear operation. insp_char-task_list_group = gt_alv-plnnr ."检验计划组 “PLMK 检验计划特性 insp_char-group_counter = gt_alv-plnal."组计数器 insp_char-activity = gt_alv-vornr."工序 insp_char-inspchar = gt_alv-merknr."检验特性编号 insp_char-mstr_char = gt_alv-verwmerkm."主检验特性 insp_char-pmstr_char = '2010'."主检验特性的工厂 insp_char-char_descr = gt_alv-kurztext."检验特性的短文本 if gt_alv-meas_unit is not initial. select single msehi from t006a into insp_char-meas_unit where mseht = gt_alv-meas_unit and spras = sy-langu. endif. insp_char-dec_places = gt_alv-stellen."小数位 * lv_toleranzun = gt_alv-toleranzun. * lv_toleranzob = gt_alv-toleranzob. if gt_alv-toleranzun is not initial. clear lv_toleranzun. call function 'QSS0_FLTP_TO_CHAR_CONVERSION' exporting i_number_of_digits = gt_alv-stellen i_fltp_value = gt_alv-toleranzun i_value_not_initial_flag = 'X' i_screen_fieldlength = 16 importing e_char_field = lv_toleranzun. insp_char-lw_tol_lmt = lv_toleranzun."规范下限 insp_char-lw_pls_lmt = lv_toleranzun."实际下限 endif. if gt_alv-toleranzob is not initial. clear lv_toleranzob. call function 'QSS0_FLTP_TO_CHAR_CONVERSION' exporting i_number_of_digits = gt_alv-stellen i_fltp_value = gt_alv-toleranzob i_value_not_initial_flag = 'X' i_screen_fieldlength = 16 importing e_char_field = lv_toleranzob. insp_char-up_pls_lmt = lv_toleranzob."实际上限 insp_char-up_tol_lmt = lv_toleranzob."规范上限 endif. insp_char-method = gt_alv-pmtnr."检验方法 insp_char-pmethod = '2010'."检验方法的工厂 insp_char-smpl_procedure = gt_alv-stichprver."采样过程 if gt_alv-probemgeh is not initial. call function 'CONVERSION_EXIT_CUNIT_INPUT' "采样计量单位 exporting input = gt_alv-probemgeh language = sy-langu importing output = insp_char-smpl_unit exceptions unit_not_found = 1 others = 2. else. if gt_alv-plnme is not initial. call function 'CONVERSION_EXIT_CUNIT_INPUT' exporting input = gt_alv-plnme language = sy-langu importing output = insp_char-smpl_unit exceptions unit_not_found = 1 others = 2. endif. endif. if gt_alv-pruefeinh is initial."基本采样数 insp_char-smpl_quant = 1. else. insp_char-smpl_quant = gt_alv-pruefeinh. endif. insp_char-dyn_modif_rule = gt_alv-qdynregel."动态修改规则 insp_char-infofield1 = gt_alv-infofield1."字段信息 1 insp_char-infofield2 = gt_alv-infofield2."字段信息 2 insp_char-infofield3 = gt_alv-infofield3."字段信息 3 insp_char-res_org = gt_alv-res_org."结果数据源 insp_char-valid_from = sy-datum."有效日期 read table lt_qpmz with key zaehler = '2010' mkmnr = gt_alv-verwmerkm. if sy-subrc = 0. insp_char-sel_set1 = lt_qpmz-auswmenge1. insp_char-psel_set1 = lt_qpmz-auswmgwrk1. insp_char-cat_type2 = lt_qpmz-katalgart2. insp_char-code_group2 = lt_qpmz-auswmenge2. endif. read table lt_qpmk with key zaehler = '2010' mkmnr = gt_alv-verwmerkm. if sy-subrc = 0. data: p_steuerkz type qpmk-steuerkz. clear p_steuerkz. p_steuerkz = lt_qpmk-steuerkz. "检验特性/主检验特性的控制标识串 insp_char-quantitative_ind = p_steuerkz+0(1)."数量特性 insp_char-cha_master_import_modus = 'N'."参考主文件检验特性的方式 insp_char-meas_value_confirm_ind = p_steuerkz+1(1)."必须记录计量值 insp_char-attribute_required_ind = p_steuerkz+2(1)."要求特性属性的参考 if gt_alv-toleranzob is not initial. insp_char-up_tol_lmt_ind = 'X'."规范上限标识 endif. if gt_alv-toleranzun is not initial. insp_char-lw_tol_lmt_ind = 'X'."规范下限标识 endif. insp_char-target_val_check_ind = p_steuerkz+5(1)."检查目标值 insp_char-scope_ind = p_steuerkz+6(1)."检验范围 insp_char-long_term_insp_ind = gt_alv-long_term_insp_ind."长期检验 insp_char-result_recording_type = p_steuerkz+8(1)."记录类型 insp_char-docu_requ = p_steuerkz+9(1)."检验结果的必需凭证 insp_char-confirmation_category = gt_alv-confirmation_category ."特性类别 insp_char-add_sample_quantity = p_steuerkz+12(1)."附加采样 insp_char-formula_ind = p_steuerkz+14(1)."计算的特性 insp_char-sampling_procedure_ind = p_steuerkz+15(1)."要求采样过程 insp_char-qscore_and_share_relevant = p_steuerkz+16(1)."质量记分和废品份额的相关特性 insp_char-defect_no_confirmation = p_steuerkz+18(1)."记录缺陷数 insp_char-destructive_insp_ind = gt_alv-destructive_insp_ind."是否破坏性采样 insp_char-insp_tool_ind = p_steuerkz+21(1)."记录缺陷数 insp_char-auto_defct_recording = p_steuerkz+22(1)."缺陷记录已自动调用 insp_char-change_documents_req = p_steuerkz+23(1)."在结果记录期间创建更改凭证 insp_char-spc_ind = p_steuerkz+24(1)."SPC 特性 insp_char-print_ind = p_steuerkz+25(1)."打印 endif. append insp_char. clear insp_char. endloop. textallocation-object_type = '10'. "固定值不用管 textallocation-valid_from = sy-datum. textallocation-task_list_group = gt_alv-plnnr ."检验计划组 textallocation-group_counter = gt_alv-plnal."组计数器 textallocation-langu = sy-langu. textallocation-line_from = '1'. textallocation-line_to = '2'. append textallocation. clear textallocation. text-format_col = '1'. text-text_line = gt_alv-ktext. append text. clear text. text-format_col = '2'. text-text_line = gt_alv-text_line. append text. clear text. sort task by task_list_group group_counter . delete adjacent duplicates from task comparing task_list_group group_counter . sort operation by task_list_group group_counter activity . delete adjacent duplicates from operation comparing task_list_group group_counter activity . "创建 data: sf. data:l_message type string. call function 'BAPI_INSPECTIONPLAN_CREATE' importing group = group groupcounter = gp_cnter tables task = task materialtaskallocation = matnr_allc operation = operation inspcharacteristic = insp_char text = text textallocation = textallocation return = retn. read table retn with key type = 'E' id = 'BAPI'. if sy-subrc = 0. clear l_message. loop at retn where type eq 'E' or type eq 'A'. call function 'MESSAGE_TEXT_BUILD' exporting msgid = retn-id msgnr = retn-number msgv1 = retn-message_v1 msgv2 = retn-message_v2 msgv3 = retn-message_v3 msgv4 = retn-message_v4 importing message_text_output = retn-message. concatenate l_message retn-message into l_message. endloop. gt_plnnr-msage = l_message. gt_plnnr-mtype = 'E'. else. call function 'BAPI_TRANSACTION_COMMIT'. call function 'MESSAGE_TEXT_BUILD' exporting msgid = retn-id msgnr = retn-number msgv1 = retn-message_v1 msgv2 = retn-message_v2 msgv3 = retn-message_v3 msgv4 = retn-message_v4 importing message_text_output = gt_plnnr-msage. gt_plnnr-mtype = 'S'. gt_plnnr-msage = '组:' && gt_plnnr-plnnr && '计数器' && gt_plnnr-plnal && '创建成功!'. endif. modify gt_plnnr. refresh : task,matnr_allc,operation,insp_char,retn. clear: task, matnr_allc,operation,insp_char,retn,group,gp_cnter. endloop. "回写消息 if p_cb1 = 'X'. loop at gt_alvk . read table gt_plnnr with key plnnr = gt_alvk-plnnr plnal = gt_alvk-plnal werks = gt_alvk-werks. if sy-subrc = 0. gt_alvk-msage = gt_plnnr-msage. gt_alvk-mtype = gt_plnnr-mtype. perform frm_set_color changing gt_alv. modify gt_alvk transporting msage mtype. endif. endloop. else. loop at gt_alv . read table gt_plnnr with key plnnr = gt_alv-plnnr plnal = gt_alv-plnal werks = gt_alv-werks. if sy-subrc = 0. gt_alv-msage = gt_plnnr-msage. gt_alv-mtype = gt_plnnr-mtype. perform frm_set_color changing gt_alv. modify gt_alv transporting msage mtype. endif. endloop. endif. endif. 代码中有一些perform的函数没有具体的代码,补全代码使其能运行并解释组计数器的原理
最新发布
11-12
function problem2_complete_optimization() % 问题2:完整的优化解决方案 fprintf('=== 问题2:基于BMI分组确定最佳NIPT时点(完整优化方案) ===\n'); % 步骤1: 数据准备和高级预处理 fprintf('=== 数据准备和高级预处理 ===\n'); data = readtable('筛查后数据.xlsx'); % 提取男胎数据并添加更多特征 male_idx = ~isnan(data{:,22}) & data{:,22} > 0; male_data = data(male_idx, :); % 提取关键变量 bmi = male_data{:,11}; gestational_week = convert_gestational_week(male_data{:,10}); y_concentration = male_data{:,22}; age = male_data{:,3}; weight = male_data{:,5}; height = male_data{:,4}; pregnancy_count = male_data{:,9}; % 计算派生特征 bmi_squared = bmi.^2; age_bmi_interaction = age .* bmi; week_bmi_interaction = gestational_week .* bmi; fprintf('男胎样本数: %d\n', sum(male_idx)); fprintf('BMI范围: [%.2f, %.2f]\n', min(bmi), max(bmi)); fprintf('孕周范围: [%.1f, %.1f]周\n', min(gestational_week), max(gestational_week)); % 步骤2: 特征选择和高级模型训练 fprintf('\n=== 特征选择和高级模型训练 ===\n'); % 准备特征矩阵 X = [gestational_week, bmi, age, weight, height, pregnancy_count, ... bmi_squared, age_bmi_interaction, week_bmi_interaction]; feature_names = {'孕周', 'BMI', '年龄', '体重', '身高', '怀孕次数', ... 'BMI平方', '年龄×BMI', '孕周×BMI'}; y = y_concentration; % 移除缺失值 valid_idx = all(~isnan(X), 2) & ~isnan(y); X = X(valid_idx, :); y = y(valid_idx); % 特征重要性分析 [selected_features, feature_importance] = select_important_features(X, y, feature_names); X_selected = X(:, selected_features); % 数据标准化 [X_std, X_mean, X_std_val] = zscore(X_selected); [y_std, y_mean, y_std_val] = zscore(y); % 使用贝叶斯优化训练集成模型 fprintf('训练优化模型...\n'); [optimal_model, optimization_results] = train_optimized_model(X_std, y_std); % 模型评估 y_pred = predict(optimal_model, X_std); r2 = 1 - sum((y_std - y_pred).^2) / sum((y_std - mean(y_std)).^2); rmse = sqrt(mean((y_std - y_pred).^2)); fprintf('优化模型性能: R² = %.4f, RMSE = %.4f\n', r2, rmse); % 步骤3: 智能BMI分组策略 fprintf('\n=== 智能BMI分组策略 ===\n'); % 使用多种方法确定最佳分组 [bmi_boundaries, grouping_quality] = determine_optimal_bmi_groups(bmi, y_concentration); n_groups = length(bmi_boundaries) - 1; fprintf('最佳BMI分组边界: %s\n', mat2str(bmi_boundaries')); fprintf('分组质量指标: %.4f\n', grouping_quality); fprintf('分组数量: %d\n', n_groups); % 步骤4: 多目标风险优化 fprintf('\n=== 多目标风险优化 ===\n'); % 定义多目标风险函数 risk_params = struct(); risk_params.weights = [0.6, 0.3, 0.1]; % [检测失败, 延迟风险, 个体差异] risk_params.age_threshold = 35; risk_params.bmi_penalty = 0.05; % 为每个分组进行多目标优化 optimization_results = cell(n_groups, 1); for group = 1:n_groups fprintf('优化组%d...\n', group); optimization_results{group} = optimize_group_risk(... group, bmi_boundaries, bmi, age, optimal_model, ... X_mean, X_std_val, y_mean, y_std_val, risk_params); end % 步骤5: 鲁棒性分析和误差传播 fprintf('\n=== 鲁棒性分析和误差传播 ===\n'); robustness_results = perform_robustness_analysis(... X_selected, y, bmi_boundaries, optimal_model, risk_params); % 步骤6: 临床可行性验证 fprintf('\n=== 临床可行性验证 ===\n'); validation_results = clinical_validation(... optimization_results, robustness_results, bmi_boundaries); % 步骤7: 综合结果展示 fprintf('\n=== 最终优化结果 ===\n'); display_final_results(optimization_results, robustness_results, ... validation_results, bmi_boundaries); % 步骤8: 高级可视化 fprintf('\n=== 生成高级可视化 ===\n'); create_comprehensive_visualization(... bmi, gestational_week, y_concentration, age, ... bmi_boundaries, optimization_results, robustness_results, ... optimal_model, X_mean, X_std_val, y_mean, y_std_val, risk_params); % 保存完整结果 save_comprehensive_results(... optimization_results, robustness_results, validation_results, ... optimal_model, bmi_boundaries, feature_importance); fprintf('\n=== 优化完成 ===\n'); end function [selected_idx, importance] = select_important_features(X, y, feature_names) % 特征选择使用随机森林重要性 % 训练随机森林 rf_model = TreeBagger(100, X, y, 'Method', 'regression', ... 'OOBPredictorImportance', 'on'); % 获取特征重要性 importance = rf_model.OOBPermutedPredictorDeltaError; [sorted_importance, sorted_idx] = sort(importance, 'descend'); % 选择重要性高于平均值的特征 importance_threshold = mean(importance); selected_idx = find(importance > importance_threshold); fprintf('特征重要性排名:\n'); for i = 1:length(sorted_idx) fprintf('%d. %s: %.4f\n', i, feature_names{sorted_idx(i)}, sorted_importance(i)); end fprintf('选择的特征: %s\n', strjoin(feature_names(selected_idx), ', ')); end function [model, results] = train_optimized_model(X, y) % 使用贝叶斯优化训练模型 rng(42); % 可重复性 % 定义超参数空间 params = [ optimizableVariable('NumTrees', [50, 200], 'Type', 'integer'),... optimizableVariable('MaxDepth', [3, 15], 'Type', 'integer'),... optimizableVariable('MinLeafSize', [1, 20], 'Type', 'integer'),... optimizableVariable('LearnRate', [0.01, 0.2], 'Transform', 'log') ]; % 贝叶斯优化函数 objective_func = @(params) bayesopt_objective(params, X, y); % 运行优化 results = bayesopt(objective_func, params, ... 'MaxObjectiveEvaluations', 50, ... 'UseParallel', false, ... 'Verbose', 0); % 最佳参数 best_params = results.XAtMinObjective; % 训练最终模型 model = fitrensemble(X, y, ... 'Method', 'LSBoost', ... 'NumLearningCycles', best_params.NumTrees, ... 'Learners', templateTree('MaxNumSplits', best_params.MaxDepth, ... 'MinLeafSize', best_params.MinLeafSize), ... 'LearnRate', best_params.LearnRate); end function loss = bayesopt_objective(params, X, y) % 贝叶斯优化目标函数 cv = cvpartition(size(X,1), 'KFold', 5); rmse_values = zeros(cv.NumTestSets, 1); for fold = 1:cv.NumTestSets train_idx = cv.training(fold); test_idx = cv.test(fold); model = fitrensemble(X(train_idx,:), y(train_idx), ... 'Method', 'LSBoost', ... 'NumLearningCycles', params.NumTrees, ... 'Learners', templateTree('MaxNumSplits', params.MaxDepth, ... 'MinLeafSize', params.MinLeafSize), ... 'LearnRate', params.LearnRate, ... 'Verbose', 0); y_pred = predict(model, X(test_idx,:)); rmse_values(fold) = sqrt(mean((y(test_idx) - y_pred).^2)); end loss = mean(rmse_values); end function [boundaries, quality] = determine_optimal_bmi_groups(bmi, y_concentration) % 确定最佳BMI分组 % 多种分组方法 methods = {'percentile', 'kmeans', 'jenks'}; qualities = zeros(length(methods), 1); all_boundaries = cell(length(methods), 1); for i = 1:length(methods) switch methods{i} case 'percentile' boundaries = prctile(bmi, [0, 25, 50, 75, 100]); quality = calculate_grouping_quality(bmi, y_concentration, boundaries); case 'kmeans' [~, centers] = kmeans(bmi, 4); boundaries = [min(bmi); sort(centers); max(bmi)]; quality = calculate_grouping_quality(bmi, y_concentration, boundaries); case 'jenks' % 简化的Jenks自然断点法 boundaries = simplified_jenks(bmi, 4); quality = calculate_grouping_quality(bmi, y_concentration, boundaries); end qualities(i) = quality; all_boundaries{i} = boundaries; end % 选择最佳分组 [best_quality, best_idx] = max(qualities); boundaries = all_boundaries{best_idx}; quality = best_quality; fprintf('最佳分组方法: %s (质量: %.4f)\n', methods{best_idx}, quality); end function quality = calculate_grouping_quality(bmi, y_concentration, boundaries) % 计算分组质量 n_groups = length(boundaries) - 1; group_means = zeros(n_groups, 1); group_vars = zeros(n_groups, 1); for group = 1:n_groups idx = bmi >= boundaries(group) & bmi <= boundaries(group+1); group_means(group) = mean(y_concentration(idx)); group_vars(group) = var(y_concentration(idx)); end % 组间方差 / 组内方差 quality = var(group_means) / mean(group_vars); end function boundaries = simplified_jenks(data, n_groups) % 简化的Jenks自然断点法 sorted_data = sort(data); n = length(sorted_data); % 初始化边界 boundaries = linspace(1, n, n_groups + 1); boundaries = round(boundaries); boundaries = sorted_data(boundaries)'; % 简单优化 for iter = 1:100 improved = false; for i = 2:n_groups current_boundary = boundaries(i); # 尝试移动边界 for delta = -5:5 if delta == 0, continue; end new_boundary_idx = find(sorted_data == current_boundary, 1) + delta; if new_boundary_idx < 1 || new_boundary_idx > n continue; end new_boundaries = boundaries; new_boundaries(i) = sorted_data(new_boundary_idx); current_quality = calculate_grouping_quality(data, data, boundaries); new_quality = calculate_grouping_quality(data, data, new_boundaries); if new_quality > current_quality boundaries = new_boundaries; improved = true; break; end end end if ~improved break; end end end function result = optimize_group_risk(group, boundaries, bmi, age, model, ... X_mean, X_std_val, y_mean, y_std_val, params) % 为单个分组进行风险优化 % 分组统计 group_idx = find(bmi >= boundaries(group) & bmi <= boundaries(group+1)); group_bmi = mean(bmi(group_idx)); group_age = mean(age(group_idx)); group_size = length(group_idx); % 定义多目标风险函数 multi_risk_func = @(week) multi_objective_risk(... week, group_bmi, group_age, model, ... X_mean, X_std_val, y_mean, y_std_val, params); % 使用多起点优化 week_range = linspace(10, 25, 50); risks = arrayfun(multi_risk_func, week_range); [min_risk, min_idx] = min(risks); initial_guess = week_range(min_idx); % 精细优化 options = optimset('Display', 'off', 'TolX', 0.01, 'TolFun', 1e-6); [optimal_week, optimal_risk] = fminbnd(multi_risk_func, 10, 25, options); % 计算各个风险分量 [~, risk_components] = multi_risk_func(optimal_week); result = struct(); result.group = group; result.bmi_range = [boundaries(group), boundaries(group+1)]; result.optimal_week = optimal_week; result.optimal_risk = optimal_risk; result.risk_components = risk_components; result.group_size = group_size; result.group_bmi = group_bmi; result.group_age = group_age; end function [total_risk, components] = multi_objective_risk(week, bmi_val, age_val, model, ... X_mean, X_std_val, y_mean, y_std_val, params) % 多目标风险函数 % 预测浓度 X_pred = [week, bmi_val, age_val, week*bmi_val]; X_pred_std = (X_pred - X_mean) ./ X_std_val; y_pred = predict(model, X_pred_std); y_original = y_pred * y_std_val + y_mean; % 各个风险分量 components = struct(); % 1. 检测失败风险 components.failure_risk = 1 / (1 + exp(20*(y_original - 0.04))); % 2. 延迟风险 if week <= 12 components.delay_risk = 0; elseif week <= 20 components.delay_risk = ((week - 12) / 8)^2; else components.delay_risk = 0.25 + ((week - 20) / 5)^1.5; end % 3. 个体差异风险 components.individual_risk = max(0, (age_val - params.age_threshold) / 20) + ... params.bmi_penalty * max(0, bmi_val - 30); % 加权总风险 total_risk = params.weights(1) * components.failure_risk + ... params.weights(2) * components.delay_risk + ... params.weights(3) * components.individual_risk; end function results = perform_robustness_analysis(X, y, boundaries, model, risk_params) % 鲁棒性分析 n_groups = length(boundaries) - 1; n_simulations = 100; results = struct(); results.optimal_weeks = zeros(n_groups, n_simulations); results.risks = zeros(n_groups, n_simulations); parfor sim = 1:n_simulations % 添加噪声和重采样 noise_level = 0.01 + 0.04 * rand(); % 1-5%的噪声 X_noisy = X + noise_level * randn(size(X)); y_noisy = y + noise_level * randn(size(y)); % 重新训练简化模型 [X_std, X_mean, X_std_val] = zscore(X_noisy); [y_std, y_mean, y_std_val] = zscore(y_noisy); simple_model = fitrensemble(X_std, y_std, ... 'Method', 'LSBoost', 'NumLearningCycles', 50, 'LearnRate', 0.1); % 为每个分组优化 for group = 1:n_groups bmi_center = mean(boundaries(group:group+1)); age_center = mean(X_noisy(:,3)); % 假设第三列是年龄 risk_func = @(week) multi_objective_risk(... week, bmi_center, age_center, simple_model, ... X_mean, X_std_val, y_mean, y_std_val, risk_params); % 网格搜索 week_test = linspace(10, 25, 30); risks = arrayfun(risk_func, week_test); [min_risk, min_idx] = min(risks); results.optimal_weeks(group, sim) = week_test(min_idx); results.risks(group, sim) = min_risk; end end % 计算统计量 for group = 1:n_groups results.mean_week(group) = mean(results.optimal_weeks(group, :)); results.std_week(group) = std(results.optimal_weeks(group, :)); results.ci_week(group, :) = prctile(results.optimal_weeks(group, :), [2.5, 97.5]); end end function results = clinical_validation(opt_results, robustness_results, boundaries) % 临床可行性验证 n_groups = length(boundaries) - 1; results = struct(); for group = 1:n_groups opt_week = opt_results{group}.optimal_week; robustness_ci = robustness_results.ci_week(group, :); % 检查临床可行性 results(group).feasible = (opt_week >= 10) && (opt_week <= 25); results(group).robust = (robustness_ci(2) - robustness_ci(1)) < 2; % CI宽度小于2周 results(group).practical = (round(opt_week * 2) / 2) == opt_week; % 可半周为单位 % 风险评估 if opt_week < 12 results(group).risk_level = '低'; elseif opt_week < 20 results(group).risk_level = '中'; else results(group).risk_level = '高'; end end end function display_final_results(opt_results, robustness_results, validation_results, boundaries) % 显示最终结果 fprintf('\n=== 最终优化结果汇总 ===\n'); fprintf('组别\tBMI范围\t\t最佳时点\t风险值\t\t95%% CI\t\t\t样本数\t可行性\n'); fprintf('----\t--------\t--------\t--------\t----------------\t--------\t--------\n'); for group = 1:length(opt_results) result = opt_results{group}; valid = validation_results(group); robust = robustness_results.ci_week(group, :); feasibility = '通过'; if ~valid.feasible || ~valid.robust feasibility = '需调整'; end fprintf('%d\t[%.1f,%.1f]\t%.1f周\t%.4f\t[%.1f,%.1f]\t%d\t%s\n', ... group, result.bmi_range(1), result.bmi_range(2), ... result.optimal_week, result.optimal_risk, ... robust(1), robust(2), result.group_size, feasibility); end end function gestational_week_numeric = convert_gestational_week(week_data) % 孕周转换函数 gestational_week_numeric = zeros(size(week_data)); for i = 1:length(week_data) if iscell(week_data) week_str = week_data{i}; else week_str = char(week_data(i)); end if contains(week_str, 'w+') parts = strsplit(week_str, {'w', '+'}); weeks = str2double(parts{1}); days = str2double(parts{end}); gestational_week_numeric(i) = weeks + days/7; elseif contains(week_str, 'w') weeks = str2double(regexp(week_str, '\d+', 'match', 'once')); gestational_week_numeric(i) = weeks; else gestational_week_numeric(i) = str2double(week_str); end end end % 运行完整优化解决方案 problem2_complete_optimization(); 优化上述代码
09-07
# power/power_sync.py import json import os import re import logging import sys from pathlib import Path from shutil import copy2 from datetime import datetime from utils import resource_path from typing import Dict, List, Tuple, Any # ------------------------------- # 日志配置 # ------------------------------- PROJECT_ROOT = Path(__file__).parent.parent.resolve() LOG_DIR = PROJECT_ROOT / "output" / "log" LOG_DIR.mkdir(parents=True, exist_ok=True) LOG_FILE = LOG_DIR / f"power_sync_{datetime.now().strftime('%Y%m%d_%H%M%S')}.log" class PowerTableSynchronizer: def __init__(self, c_file_path=None, dry_run=False, config_path="config/config.json"): self.logger = logging.getLogger(__name__) # === Step 1: 使用 resource_path 解析所有路径 === self.config_file_path = resource_path(config_path) logging.info(f"配置文件: {self.config_file_path}") if not os.path.exists(self.config_file_path): raise FileNotFoundError(f"配置文件不存在: {self.config_file_path}") try: with open(self.config_file_path, 'r', encoding='utf-8') as f: self.config = json.load(f) print(f"配置文件已加载: {self.config_file_path}") except json.JSONDecodeError as e: raise ValueError(f"配置文件格式错误,JSON 解析失败: {self.config_file_path}") from e except Exception as e: raise RuntimeError(f"读取配置文件时发生未知错误: {e}") from e self.dry_run = dry_run # === Step 2: 目标 C 文件处理 === if c_file_path is None: if "target_c_file" not in self.config: raise KeyError("config 文件缺少 'target_c_file' 字段") internal_c_path = self.config["target_c_file"] logging.info(f"使用内置 C 文件: {internal_c_path}") self.c_file_path =resource_path(internal_c_path) self._is_internal_c_file = True else: self.c_file_path = Path(c_file_path) self._is_internal_c_file = False if not self.c_file_path.exists(): raise FileNotFoundError(f"找不到 C 源文件: {self.c_file_path}") # === Step 3: 初始化数据容器 === self.locale_enums = {} # enum_name -> {"macros": [macro], "values": {macro: idx}} self.power_tables = {} # table_name -> [lines] self.table_pending_appends = {} # table_name -> List[str] # === Step 4: 加载锚点标记 === for marker_key in ["STR_POWER_LOCALE_ENUM", "END_POWER_LOCALE_ENUM", "STR_POWER_TABLE", "END_POWER_TABLE"]: if marker_key not in self.config: raise KeyError(f"config 文件缺少 '{marker_key}' 字段") self.start_enum_marker = self.config["STR_POWER_LOCALE_ENUM"] self.end_enum_marker = self.config["END_POWER_LOCALE_ENUM"] self.start_table_marker = self.config["STR_POWER_TABLE"] self.end_table_marker = self.config["END_POWER_TABLE"] # === Step 5: 功率表文件 === gen_file = PROJECT_ROOT / "output" / "tx_limit_table.c" if not gen_file.exists(): self.logger.error(f"❌ 找不到生成文件: {gen_file}") raise FileNotFoundError(f"请先运行 excel_to_clm.py 生成 tx_limit_table.c: {gen_file}") try: self.power = gen_file.read_text(encoding='utf-8') except Exception as e: self.logger.error(f"❌ 读取 {gen_file} 失败: {e}") raise # 加载 locale_targets 配置 if "locale_targets" not in self.config: raise KeyError("config 文件缺少 'locale_targets' 字段") required_keys = {"enum", "table", "suffix"} for i, item in enumerate(self.config["locale_targets"]): if not isinstance(item, dict) or not required_keys.issubset(item.keys()): raise ValueError(f"locale_targets[{i}] 缺少必要字段 {required_keys}: {item}") self.locale_targets = self.config["locale_targets"] self.logger.info(f"已加载 {len(self.locale_targets)} 个 Locale 映射目标") def offset_to_lineno(self, content: str, offset: int) -> int: """将字符偏移量转换为行号(从1开始)""" return content.count('\n', 0, offset) + 1 def parse_c_power_definitions(self): """解析 C 文件中的 enum locale_xxx_idx 和 static const unsigned char locales_xxx[]""" content = self.c_file_path.read_text(encoding='utf-8') # --- 解析 ENUM 区域 --- try: enum_start_idx = content.find(self.start_enum_marker) enum_end_idx = content.find(self.end_enum_marker) if enum_start_idx == -1 or enum_end_idx == -1: raise ValueError("未找到 LOCALE ENUM 标记块") enum_block = content[enum_start_idx:enum_end_idx] start_line = self.offset_to_lineno(content, enum_start_idx) end_line = self.offset_to_lineno(content, enum_end_idx) self.logger.info(f"找到 ENUM 标记范围:第 {start_line} 行 → 第 {end_line} 行") enum_pattern = re.compile( r'(enum\s+locale_[a-zA-Z0-9_]+(?:_[a-zA-Z0-9_]+)*_idx\s*\{)([^}]*)\}\s*;', re.DOTALL | re.IGNORECASE ) for match in enum_pattern.finditer(enum_block): enum_decl = match.group(0) self.logger.info(f" 解析枚举声明: {enum_decl}") enum_name_match = re.search(r'locale_[\w\d_]+_idx', enum_decl, re.IGNORECASE) if not enum_name_match: continue enum_name = enum_name_match.group(0).lower() body = match.group(2) # 在 parse_c_power_definitions() 中 body_no_comment = re.sub(r'//.*|/\*.*?\*/', '', body, flags=re.DOTALL) # 只提取 = 数字 的宏 valid_assignments = re.findall( r'(LOCALE_[A-Za-z0-9_]+)\s*=\s*(-?\b\d+\b)', body_no_comment ) macro_list = [m[0] for m in valid_assignments] value_map = {m: int(v) for m, v in valid_assignments} self.locale_enums[enum_name] = { "macros": macro_list, "values": value_map, "raw_body": body } self.logger.info(f" 解析枚举 {enum_name}: {len(macro_list)} 个宏") except Exception as e: self.logger.error(f"解析 ENUM 失败: {e}", exc_info=True) # --- 解析 TABLE 区域 --- try: table_start_idx = content.find(self.start_table_marker) table_end_idx = content.find(self.end_table_marker) if table_start_idx == -1 or table_end_idx == -1: raise ValueError("未找到 POWER TABLE 标记块") table_block = content[table_start_idx:table_end_idx] start_line = self.offset_to_lineno(content, table_start_idx) end_line = self.offset_to_lineno(content, table_end_idx) self.logger.info(f"找到 TABLE 标记范围:第 {start_line} 行 → 第 {end_line} 行") self.logger.info(" 提取的 table_block 示例:") self.logger.info(table_block[:500]) self.logger.info(table_block[-500:]) # === 增强解析 TABLE:按 /* Locale X */ 分块提取 === array_matches = list(re.finditer( r'static\s+const\s+unsigned\s+char\s+([\w\d_]+)\s*\[\s*\]\s*=\s*\{', table_block, re.IGNORECASE )) if not array_matches: self.logger.warning("未在 TABLE 区域找到任何 power table 数组定义") else: for match in array_matches: table_name = match.group(1) start_pos = match.start() brace_start = table_block.find('{', match.end()) if brace_start == -1: continue # 手动匹配大括号内容(支持嵌套) depth = 0 body_content = "" i = brace_start + 1 while i < len(table_block): c = table_block[i] if c == '{': depth += 1 elif c == '}': if depth == 0: break depth -= 1 body_content += c i += 1 # 逐行解析 body_content,按 /* Locale X */ 分块 lines = body_content.splitlines() entries = [] # 存储每一块: {'locale_tag': 'a_359', 'lines': [...]} current_block = [] current_locale = None for line in lines: stripped = line.strip() # 检查是否是新的 Locale 注释 comment_match = re.match(r'/\*\s*Locale\s+([A-Za-z0-9_-]+)\s*\([^)]+\)\s*\*/', stripped, re.IGNORECASE) if comment_match: # 保存上一个 block if current_locale and current_block: entries.append({ 'locale_tag': current_locale.lower(), 'lines': [ln.rstrip(',').rstrip() for ln in current_block] }) # 开始新 block raw_name = comment_match.group(1) # 如 A-359 normalized = raw_name.replace('-', '_').upper() # → A_359 current_locale = normalized current_block = [] continue # 忽略空行、纯注释行 clean_line = re.sub(r'/\*.*?\*/|//.*', '', stripped).strip() if clean_line: current_block.append(stripped) # 保留原始格式(含缩进和逗号) # 保存最后一个 block if current_locale and current_block: entries.append({ 'locale_tag': current_locale.lower(), 'lines': [ln.rstrip(',').rstrip() for ln in current_block] }) self.power_tables[table_name] = entries self.logger.info(f" 解析数组 {table_name}: {len(entries)} 个 Locale 数据块") except Exception as e: self.logger.error(f"解析 TABLE 失败: {e}", exc_info=True) def validate_and_repair(self): modified = False changes = [] all_locale_data = self.extract_all_raw_locale_data() for target in self.locale_targets: enum_name = target["enum"] table_name = target["table"] suffix = target["suffix"] # 关键修改:使用 assigned_locale,而不是 used_locales[idx] if "assigned_locale" not in target: raise KeyError(f"locale_targets 缺少 'assigned_locale': {target}") locale = target["assigned_locale"] macro_name = f"LOCALE_{suffix}_IDX_{locale.replace('-', '_').upper()}" if locale not in all_locale_data: raise RuntimeError(f"❌ 在 tx_limit_table.c 中找不到 Locale 数据: {locale}") data_lines = all_locale_data[locale] # --- 处理 ENUM --- if enum_name not in self.locale_enums: self.logger.warning(f"未找到枚举定义: {enum_name}") continue enum_data = self.locale_enums[enum_name] macros = enum_data["macros"] values = enum_data["values"] next_idx = self._get_next_enum_index(enum_name) if macro_name not in macros: macros.append(macro_name) values[macro_name] = next_idx changes.append(f"ENUM + {macro_name} = {next_idx}") modified = True if "pending_updates" not in enum_data: enum_data["pending_updates"] = [] enum_data["pending_updates"].append((macro_name, next_idx)) # --- 处理 TABLE --- if table_name not in self.power_tables: self.logger.warning(f"未找到 power table 数组: {table_name}") continue current_entries = self.power_tables[table_name] # 已经是 [{'locale_tag': ..., 'lines': [...]}, ...] # 归一化目标 locale 名称:us → US → us(小写用于比较) target_locale_normalized = target["assigned_locale"].replace('-', '_').upper().lower() # 检查是否已存在 already_exists = any( entry['locale_tag'] == target_locale_normalized for entry in current_entries ) if already_exists: self.logger.debug(f"Locale '{target['assigned_locale']}' 已存在于 {table_name},跳过") continue cleaned_new_lines = [ re.sub(r'/\*.*?\*/|//.*', '', ln).strip().rstrip(',') for ln in data_lines if re.sub(r'/\*.*?\*/|//.*', '', ln).strip() ] # 添加到内存结构(保持块结构) current_entries.append({ 'locale_tag': target_locale_normalized, 'lines': cleaned_new_lines }) changes.append(f"TABLE + {len(cleaned_new_lines)} 行 → {table_name}") modified = True # 记录待写入的数据(含原始行,用于生成注释) if table_name not in self.table_pending_appends: self.table_pending_appends[table_name] = [] self.table_pending_appends[table_name].append({ 'locale_tag': target["assigned_locale"], # 原始名(如 us) 'data_lines': data_lines # 原始带注释/缩进的行 }) if changes: self.logger.info(f"共需添加 {len(changes)} 项:\n" + "\n".join(f" → {ch}" for ch in changes)) return modified def _get_next_enum_index(self, enum_name): """基于已解析的 values 获取下一个可用索引""" if enum_name not in self.locale_enums: self.logger.warning(f"未找到枚举定义: {enum_name}") return 0 value_map = self.locale_enums[enum_name]["values"] # ✅ 直接使用已解析的数据 if not value_map: return 0 # 只考虑非负数(排除 CLM_LOC_NONE=-1, CLM_LOC_SAME=-2 等保留值) used_indices = [v for v in value_map.values() if v >= 0] if used_indices: next_idx = max(used_indices) + 1 else: next_idx = 0 # 没有有效数值时从 0 开始 return next_idx def extract_all_raw_locale_data(self) -> Dict[str, List[str]]: """ 从 output/tx_limit_table.c 中提取所有 /* Locale XXX */ 后面的数据块(直到下一个 Locale 或 EOF) 使用逐行解析,避免正则不匹配问题 """ #self.logger.info(f"📄 正在解析文件: {self.power}") #self.logger.info(f"🔍 前 300 字符:\n{self.power[:300]}") lines = self.power.splitlines() locale_data = {} current_locale = None current_block = [] for i, line in enumerate(lines): stripped = line.strip() # 检查是否是新的 Locale 标记 match = re.match(r'/\*\s*Locale\s+([A-Za-z0-9_]+)\s*\*/', stripped, re.IGNORECASE) if match: # 保存上一个 block if current_locale: # 清理 block:去空行、去注释、去尾逗号 cleaned = [ re.sub(r'/\*.*?\*/|//.*', '', ln).strip().rstrip(',') for ln in current_block if re.sub(r'/\*.*?\*/|//.*', '', ln).strip() ] locale_data[current_locale] = cleaned self.logger.debug(f" 已提取 Locale {current_locale},共 {len(cleaned)} 行") # 开始新 block current_locale = match.group(1) current_block = [] self.logger.debug(f" 发现 Locale: {current_locale}") continue # 收集当前 locale 的内容 if current_locale is not None: current_block.append(line) # 别忘了最后一个 block if current_locale: cleaned = [ re.sub(r'/\*.*?\*/|//.*', '', ln).strip().rstrip(',') for ln in current_block if re.sub(r'/\*.*?\*/|//.*', '', ln).strip() ] locale_data[current_locale] = cleaned self.logger.debug(f" 已提取最后 Locale {current_locale},共 {len(cleaned)} 行") self.logger.info(f" 成功提取 {len(locale_data)} 个 Locale 数据块: {list(locale_data.keys())}") return locale_data def _write_back_in_blocks(self): """将修改后的 enum 和 table 块写回原 C 文件,基于锚点 block 精准更新""" self.logger.info("正在写回修改后的数据...") if self.dry_run: self.logger.info("DRY-RUN: 跳过写入文件") return try: content = self.c_file_path.read_text(encoding='utf-8') # === Step 1: 查找所有锚点位置 === enum_start = content.find(self.start_enum_marker) enum_end = content.find(self.end_enum_marker) table_start = content.find(self.start_table_marker) table_end = content.find(self.end_table_marker) if -1 in (enum_start, enum_end, table_start, table_end): missing = [] if enum_start == -1: missing.append(f"起始 ENUM: {self.start_enum_marker}") if enum_end == -1: missing.append(f"结束 ENUM: {self.end_enum_marker}") if table_start == -1: missing.append(f"起始 TABLE: {self.start_table_marker}") if table_end == -1: missing.append(f"结束 TABLE: {self.end_table_marker}") raise ValueError(f"未找到锚点标记: {missing}") # === Step 2: 定义总修改范围 [header][block][footer] === enum_block=content[enum_start:enum_end] table_block=content[table_start:table_end] self.logger.info(f" 修改枚举范围: 第 {self.offset_to_lineno(content, enum_start)} 行 → " f"{self.offset_to_lineno(content, enum_end)} 行") self.logger.info(f" 修改数组范围: 第 {self.offset_to_lineno(content, table_start)} 行 → " f"{self.offset_to_lineno(content, table_end)} 行") replacements = [] # (start_in_block, end_in_block, replacement) def remove_comments(text): text = re.sub(r'//.*$', '', text, flags=re.MULTILINE) text = re.sub(r'/\*.*?\*/', '', text, flags=re.DOTALL) return text.strip() # === Step 3: 更新 ENUMs - 使用 pending_updates 中记录的所有宏 === for target in self.locale_targets: enum_name_key = target["enum"] enum_data = self.locale_enums.get(enum_name_key) if not enum_data or "pending_updates" not in enum_data: continue insertions = enum_data["pending_updates"] if not insertions: continue pattern = re.compile( rf'(enum\s+{re.escape(enum_name_key)}\s*\{{)([^}}]*)\}}\s*;', re.DOTALL | re.IGNORECASE ) match = pattern.search(enum_block) if not match: self.logger.warning(f"未找到枚举: {enum_name_key}") continue header_part = match.group(1) body_content = match.group(2) full_start = match.start() full_end = match.end() lines = [ln for ln in body_content.split('\n') if ln.strip()] last_line = lines[-1] if lines else "" clean_last = remove_comments(last_line) indent_match = re.match(r'^(\s*)', last_line) line_indent = indent_match.group(1) if indent_match else " " expanded_last = last_line.expandtabs(4) first_macro_match = re.search(r'LOCALE_[A-Z0-9_]+', clean_last) if first_macro_match: raw_before = last_line[:first_macro_match.start()] expanded_before = raw_before.expandtabs(4) target_macro_col = len(expanded_before) else: target_macro_col = len(line_indent.replace('\t', ' ')) eq_match = re.search(r'=\s*\d+', clean_last) if eq_match and first_macro_match: eq_abs_start = first_macro_match.start() + eq_match.start() raw_eq_part = last_line[:eq_abs_start] expanded_eq_part = raw_eq_part.expandtabs(4) target_eq_col = len(expanded_eq_part) else: target_eq_col = target_macro_col + 30 # 合理默认值 new_body = body_content.rstrip() if not new_body.endswith(','): new_body += ',' for macro_name, next_idx in insertions: # 遍历所有 pending 插入项 current_visual_len = len(macro_name.replace('\t', ' ')) padding_to_eq = max(1, target_eq_col - target_macro_col - current_visual_len) formatted_macro = f"{macro_name}{' ' * padding_to_eq}= {next_idx}" visible_macros = len(re.findall(r'LOCALE_[A-Z0-9_]+', clean_last)) MAX_PER_LINE = 4 if visible_macros < MAX_PER_LINE and last_line.strip(): insertion = f" {formatted_macro}," updated_last = last_line.rstrip() + insertion new_body = body_content.rsplit(last_line, 1)[0] + updated_last last_line = updated_last # 更新 last_line 用于下一次判断 clean_last = remove_comments(last_line) else: raw_indent_len = len(line_indent.replace('\t', ' ')) leading_spaces = max(0, target_macro_col - raw_indent_len) prefix_padding = ' ' * leading_spaces new_line = f"\n{line_indent}{prefix_padding}{formatted_macro}," new_body += new_line last_line = new_line.strip() clean_last = remove_comments(last_line) new_enum = f"{header_part}{new_body}\n}};" # 再计算全局偏移并添加替换任务 full_start = enum_start + match.start() # 全局起始位置 full_end = enum_start + match.end() # 全局结束位置 replacements.append((full_start, full_end, new_enum)) self.logger.debug(f" 插入 ENUM: {dict(insertions)}") # 清除标记防止重复插入 enum_data.pop("pending_updates", None) # === Step 4: 更新 TABLEs —— 动态表名(追加模式)=== seen = set() table_names = [] for target in self.locale_targets: name = target["table"] if name not in seen: table_names.append(name) seen.add(name) for table_name in table_names: if table_name not in self.power_tables: self.logger.info(f" 没有需要更新的表: {table_name}") continue # 查找 assigned_locale source_locale = None for tgt in self.locale_targets: if tgt["table"] == table_name: source_locale = tgt["assigned_locale"] break if not source_locale: self.logger.warning(f" 未在 locale_targets 中找到映射: {table_name}") continue # 使用 pending_appends 判断是否需要追加 if table_name not in self.table_pending_appends: self.logger.debug(f" 跳过无待追加数据的表: {table_name}") continue data_to_insert = self.table_pending_appends[table_name] if not data_to_insert: continue pattern = re.compile( rf'(\b{re.escape(table_name)}\s*\[\s*\]\s*=\s*\{{)(.*?)(\}}\s*;\s*)', re.DOTALL | re.IGNORECASE ) match = pattern.search(table_block) if not match: self.logger.warning(f" 未找到数组定义: {table_name}") continue header_part = match.group(1) body_content = match.group(2) footer_part = match.group(3) lines = [ln for ln in body_content.split('\n') if ln.strip()] last_line = lines[-1] if lines else "" indent_match = re.match(r'^(\s*)', last_line) line_indent = indent_match.group(1) if indent_match else " " expanded_last = last_line.expandtabs(4) first_struct_match = re.search(r'\{', remove_comments(last_line)) if first_struct_match: raw_before = last_line[:first_struct_match.start()] expanded_before = raw_before.expandtabs(4) target_struct_col = len(expanded_before) else: target_struct_col = len(line_indent.replace('\t', ' ')) raw_indent_len = len(line_indent.replace('\t', ' ')) leading_spaces = max(0, target_struct_col - raw_indent_len) prefix_padding = ' ' * leading_spaces new_block = "" for item in data_to_insert: item_clean = item.rstrip(',').strip() new_block += f"\n{line_indent}{prefix_padding}{item_clean}," new_body = body_content.rstrip() + new_block full_start = table_start + match.start() full_end = table_start + match.end() new_table = f"{header_part}{new_body}\n{footer_part}" replacements.append((full_start, full_end, new_table)) self.logger.info(f"【追加】{len(data_to_insert)} 行到 {table_name}") # 写完就清除,防止重复追加 self.table_pending_appends.pop(table_name, None) # === Step 5: 应用所有替换(倒序,避免偏移错乱)=== if not replacements: self.logger.info("无任何修改需要写入") return # 按全局偏移从后往前排序,防止前面修改影响后面位置 replacements.sort(key=lambda x: x[0], reverse=True) final_content = content # 从原内容开始 for start, end, r in replacements: self.logger.info(f"替换 [{start}:{end}] → 新内容:\n{r[:150]}...") final_content = final_content[:start] + r + final_content[end:] # 可选:检查是否真的变了 if content == final_content: self.logger.info(" 文件内容未发生变化,无需写入") return backup_path = self.c_file_path.with_suffix('.c.bak') copy2(self.c_file_path, backup_path) self.logger.info(f" 已备份 → {backup_path}") self.c_file_path.write_text(final_content, encoding='utf-8') self.logger.info(f" 成功写回 C 文件: {self.c_file_path}") self.logger.info(f" 共更新 {len(replacements)} 个区块") except Exception as e: self.logger.error(f" 写回文件失败: {e}", exc_info=True) raise def run(self): self.logger.info("开始同步 POWER LOCALE 定义...") try: self.parse_c_power_definitions() was_modified = self.validate_and_repair() if was_modified: if self.dry_run: self.logger.info("预览模式:检测到变更,但不会写入文件") else: self._write_back_in_blocks() # 关键:执行写入操作 self.logger.info("同步完成:已成功更新 C 文件") else: self.logger.info("所有 Locale 已存在,无需修改") return was_modified except Exception as e: self.logger.error(f"同步失败: {e}", exc_info=True) raise def main(): logging.basicConfig( level=logging.INFO, format='%(asctime)s [%(levelname)s] %(name)s: %(message)s', handlers=[ logging.FileHandler(LOG_FILE, encoding='utf-8'), logging.StreamHandler(sys.stdout) ], force=True ) logger = logging.getLogger(__name__) # 固定配置 c_file_path = "input/wlc_clm_data_6726b0.c" dry_run = False log_level = "INFO" config_path = "config/config.json" logging.getLogger().setLevel(log_level) print(f"开始同步 POWER LOCALE 定义...") print(f"C 源文件: {c_file_path}") if dry_run: print("启用 dry-run 模式:仅预览变更,不修改文件") try: sync = PowerTableSynchronizer( c_file_path=None, dry_run=dry_run, config_path=config_path, ) sync.run() print("同步完成!") print(f"详细日志已保存至: {LOG_FILE}") except FileNotFoundError as e: logger.error(f"文件未找到: {e}") print("请检查文件路径是否正确。") sys.exit(1) except PermissionError as e: logger.error(f"权限错误: {e}") print("无法读取或写入文件,请检查权限。") sys.exit(1) except Exception as e: logger.error(f"程序异常退出: {e}", exc_info=True) sys.exit(1) if __name__ == '__main__': main()
10-24
import os import sys import json import gc import time import concurrent.futures import traceback import numpy as np import librosa import torch import psutil import noisereduce as nr from typing import List, Dict, Tuple, Optional, Any from pydub import AudioSegment, effects from pydub.silence import split_on_silence from modelscope.pipelines import pipeline from modelscope.utils.constant import Tasks from transformers import AutoModelForSequenceClassification, AutoTokenizer from PyQt5.QtWidgets import (QApplication, QMainWindow, QWidget, QVBoxLayout, QHBoxLayout, QPushButton, QLabel, QLineEdit, QTextEdit, QFileDialog, QProgressBar, QGroupBox, QMessageBox, QListWidget, QSplitter, QTabWidget, QTableWidget, QTableWidgetItem, QHeaderView, QAction, QMenu, QToolBar, QComboBox, QSpinBox, QDialog, QDialogButtonBox) from PyQt5.QtCore import QThread, pyqtSignal, Qt from PyQt5.QtGui import QFont, QColor, QIcon from collections import deque import logging import shutil import subprocess import tempfile # 配置日志 logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s') logger = logging.getLogger("DialectQA") # ====================== 工具函数 ====================== def check_ffmpeg_available() -> Tuple[bool, str]: """检查ffmpeg是否可用并返回检查结果和说明""" if not shutil.which("ffmpeg"): return False, "系统中未找到ffmpeg,请安装并添加到PATH" try: result = subprocess.run( ["ffmpeg", "-version"], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, timeout=3 ) if "ffmpeg version" in result.stdout: return True, "FFmpeg已正确安装并可用" return False, "FFmpeg可执行但返回异常输出" except (subprocess.TimeoutExpired, FileNotFoundError): return False, "FFmpeg执行失败" except Exception as e: return False, f"FFmpeg检查出错: {str(e)}" def is_gpu_available() -> bool: """检查GPU是否可用""" return torch.cuda.is_available() and torch.cuda.device_count() > 0 # ====================== 增强型资源监控器 ====================== class EnhancedResourceMonitor: def __init__(self): self.gpu_available = is_gpu_available() self.history_size = 60 # 保留60秒历史数据 self.cpu_history = deque(maxlen=self.history_size) self.gpu_history = deque(maxlen=self.history_size) self.last_check_time = time.time() def __del__(self): """析构时释放资源""" if self.gpu_available: torch.cuda.empty_cache() def memory_percent(self) -> Dict[str, float]: """获取当前内存使用百分比""" try: result = {"cpu": psutil.virtual_memory().percent} if self.gpu_available: allocated = torch.cuda.memory_allocated() / (1024 ** 3) reserved = torch.cuda.memory_reserved() / (1024 ** 3) total = torch.cuda.get_device_properties(0).total_memory / (1024 ** 3) gpu_usage = (allocated + reserved) / total * 100 if total > 0 else 0 result["gpu"] = gpu_usage else: result["gpu"] = 0.0 current_time = time.time() if current_time - self.last_check_time >= 1.0: self.cpu_history.append(result["cpu"]) if self.gpu_available: self.gpu_history.append(result["gpu"]) self.last_check_time = current_time return result except Exception as e: logger.error(f"内存监控失败: {str(e)}") return {"cpu": 0, "gpu": 0} def get_usage_trend(self) -> Dict[str, float]: """获取内存使用趋势(移动平均值)""" if not self.cpu_history: return {"cpu": 0, "gpu": 0} cpu_avg = sum(self.cpu_history) / len(self.cpu_history) gpu_avg = sum(self.gpu_history) / len(self.gpu_history) if self.gpu_available and self.gpu_history else 0 return {"cpu": cpu_avg, "gpu": gpu_avg} def is_under_heavy_load(self, threshold: float = 85.0) -> bool: """检查系统是否处于高负载状态""" current = self.memory_percent() trend = self.get_usage_trend() return any([ current["cpu"] > threshold, current["gpu"] > threshold, trend["cpu"] > threshold, trend["gpu"] > threshold ]) # ====================== 方言处理器(增强版) ====================== class EnhancedDialectProcessor: KEYWORDS = { "opening": ("您好", "很高兴为您服务", "请问有什么可以帮您", "麻烦您喽", "请问搞哪样", "有咋个可以帮您", "多谢喽", "你好", "早上好", "下午好", "晚上好"), "closing": ("感谢来电", "祝您生活愉快", "再见", "搞归一喽", "麻烦您喽", "再见喽", "慢走喽", "谢谢", "拜拜"), "forbidden": ("不知道", "没办法", "你投诉吧", "随便你", "搞不成", "没得法", "随便你喽", "你投诉吧喽", "我不懂", "自己看"), "salutation": ("先生", "女士", "小姐", "老师", "师傅", "哥", "姐", "兄弟", "妹儿", "老板", "同志"), "reassurance": ("非常抱歉", "请不要着急", "我们会尽快处理", "理解您的心情", "实在对不住", "莫急哈", "马上帮您整", "理解您得很", "不好意思", "请您谅解", "我们会尽快解决") } # 扩展贵州方言到普通话的映射 _DIALECT_ITEMS = ( ("恼火得很", "非常生气"), ("鬼火戳", "很愤怒"), ("搞不成", "无法完成"), ("没得", "没有"), ("搞哪样嘛", "做什么呢"), ("归一喽", "完成了"), ("咋个", "怎么"), ("克哪点", "去哪里"), ("麻烦您喽", "麻烦您了"), ("多谢喽", "多谢了"), ("憨包", "傻瓜"), ("归一", "结束"), ("板扎", "很好"), ("鬼火冒", "非常生气"), ("背时", "倒霉"), ("吃豁皮", "占便宜"), ("扯拐", "出问题"), ("打脑壳", "头疼"), ("二天", "以后"), ("鬼火绿", "非常生气"), ("哈数", "规矩"), ("经事", "耐用"), ("抠脑壳", "思考"), ("拉稀摆带", "不靠谱"), ("马起脸", "板着脸"), ("哦豁", "哎呀"), ("皮坨", "拳头"), ("千翻", "顽皮"), ("日鼓鼓", "生气"), ("煞角", "结束"), ("舔肥", "巴结"), ("弯酸", "刁难"), ("歪得很", "凶"), ("悬掉掉", "危险"), ("妖艳儿", "炫耀"), ("渣渣", "垃圾") ) class TrieNode: __slots__ = ('children', 'is_end', 'value') def __init__(self): self.children = {} self.is_end = False self.value = "" # 类加载时直接构建Trie树 _trie_root = TrieNode() for dialect, standard in sorted(_DIALECT_ITEMS, key=lambda x: len(x[0]), reverse=True): node = _trie_root for char in dialect: if char not in node.children: node.children[char] = EnhancedDialectProcessor.TrieNode() node = node.children[char] node.is_end = True node.value = standard @classmethod def preprocess_text(cls, texts: List[str]) -> List[str]: """使用预构建的Trie树进行方言转换""" return [cls._process_single_text(text) for text in texts] @classmethod def _process_single_text(cls, text: str) -> str: """处理单个文本的核心逻辑""" result = [] i = 0 n = len(text) while i < n: node = cls._trie_root j = i last_match = None # 查找最长匹配 while j < n and text[j] in node.children: node = node.children[text[j]] j += 1 if node.is_end: last_match = (j, node.value) if last_match: end_index, replacement = last_match result.append(replacement) i = end_index else: result.append(text[i]) i += 1 return ''.join(result) # ====================== 系统配置管理器 ====================== class ConfigManager: __slots__ = ('config', 'dirty') _instance = None _DEFAULT_CONFIG = { "model_paths": { "asr": "D:/models/ASR-models/iic/speech_paraformer-large-vad-punc-spk_asr_nat-zh-cn", "sentiment": "D:/models/distilbert-base-multilingual-cased-sentiments-student" }, "sample_rate": 16000, "silence_thresh": -40, "min_silence_len": 1000, "max_concurrent": 1, "max_audio_duration": 3600, "enable_fp16": True, "enable_quantization": True, "max_sentiment_batch_size": 16 } def __new__(cls): if cls._instance is None: cls._instance = super().__new__(cls) cls._instance.dirty = False cls._instance.config = cls._DEFAULT_CONFIG.copy() cls._instance.load_config() return cls._instance def load_config(self): """加载配置文件""" try: if os.path.exists("config.json"): with open("config.json", "r", encoding="utf-8") as f: file_config = json.load(f) # 深度合并配置 for key, value in file_config.items(): if key in self.config and isinstance(self.config[key], dict) and isinstance(value, dict): self.config[key].update(value) else: self.config[key] = value except json.JSONDecodeError: logger.warning("配置文件格式错误,部分使用默认配置") except Exception as e: logger.error(f"加载配置失败: {str(e)},部分使用默认配置") def save_config(self, force=False): """延迟保存机制:仅当配置变化时保存""" if not force and not self.dirty: return try: with open("config.json", "w", encoding="utf-8") as f: json.dump(self.config, f, indent=2, ensure_ascii=False) self.dirty = False except Exception as e: logger.error(f"保存配置失败: {str(e)}") def get(self, key: str, default=None): return self.config.get(key, default) def set(self, key: str, value, immediate_save=False): self.config[key] = value self.dirty = True if immediate_save: self.save_config(force=True) def check_model_paths(self) -> Tuple[bool, List[str]]: errors = [] model_paths = self.get("model_paths", {}) for model_name, path in model_paths.items(): if not path: errors.append(f"{model_name}模型路径未设置") elif not os.path.exists(path): errors.append(f"{model_name}模型路径不存在: {path}") elif not os.path.isdir(path): errors.append(f"{model_name}模型路径不是有效的目录: {path}") return len(errors) == 0, errors def __del__(self): """析构时自动保存未持久化的更改""" if self.dirty: self.save_config(force=True) # ====================== 增强型音频处理器 ====================== class EnhancedAudioProcessor: SUPPORTED_FORMATS = ('.mp3', '.wav', '.amr', '.m4a') MAX_SEGMENT_DURATION = 5 * 60 * 1000 # 5分钟分段限制 ENHANCEMENT_CONFIG = { 'noise_sample_duration': 0.5, # 噪声采样时长(秒) 'telephone_filter_range': (300, 3400), # 电话频段范围(Hz) 'compression_threshold': -25.0, # 压缩阈值(dBFS) 'compression_ratio': 3.0 # 压缩比 } def __init__(self): self._noise_profile = None self._sample_rate = ConfigManager().get("sample_rate", 16000) @staticmethod def check_dependencies(): try: # 尝试导入所需库 import librosa import noisereduce return True, "依赖检查通过" except ImportError as e: return False, f"缺少依赖库: {str(e)}" def process_audio(self, input_path: str, temp_dir: str) -> Optional[List[str]]: """处理音频文件并返回分段文件路径列表""" if not self._validate_input(input_path, temp_dir): return None try: # 使用临时目录处理音频 with tempfile.TemporaryDirectory() as process_dir: audio = self._load_audio(input_path) if audio is None: return None # 基础预处理 audio = self._basic_preprocessing(audio) # 音频增强处理 audio = self._enhance_audio(audio) # 分段并保存 return self._segment_audio(audio, input_path, temp_dir or process_dir) except Exception as e: logger.error(f"音频处理失败: {str(e)}", exc_info=True) return None def _validate_input(self, input_path: str, temp_dir: str) -> bool: """验证输入参数有效性""" ffmpeg_available, ffmpeg_msg = check_ffmpeg_available() if not ffmpeg_available: logger.error(f"ffmpeg错误: {ffmpeg_msg}") return False deps_ok, deps_msg = self.check_dependencies() if not deps_ok: logger.error(f"依赖错误: {deps_msg}") return False os.makedirs(temp_dir, exist_ok=True) ext = os.path.splitext(input_path)[1].lower() if ext not in self.SUPPORTED_FORMATS: logger.error(f"不支持的音频格式: {ext}") return False if not os.path.exists(input_path): logger.error(f"文件不存在: {input_path}") return False return True def _load_audio(self, input_path: str) -> Optional[AudioSegment]: """加载音频文件""" try: return AudioSegment.from_file(input_path) except Exception as e: logger.error(f"无法加载音频文件: {str(e)}") return None def _basic_preprocessing(self, audio: AudioSegment) -> AudioSegment: """基础预处理:统一采样率和通道数""" # 确保音频为单声道 if audio.channels > 1: audio = audio.set_channels(1) # 统一采样率 if audio.frame_rate != self._sample_rate: audio = audio.set_frame_rate(self._sample_rate) return audio def _enhance_audio(self, audio: AudioSegment) -> AudioSegment: """执行音频增强处理流水线""" self._analyze_noise_profile(audio) audio = self._extract_main_voice(audio) audio = self._enhance_telephone_quality(audio) return self._normalize_audio(audio) def _analyze_noise_profile(self, audio: AudioSegment): """分析噪声样本以创建噪声剖面""" try: samples = np.array(audio.get_array_of_samples()) sr = audio.frame_rate noise_duration = int(sr * self.ENHANCEMENT_CONFIG['noise_sample_duration']) self._noise_profile = samples[:min(noise_duration, len(samples))].astype(np.float32) except Exception as e: logger.warning(f"噪声分析失败: {str(e)}") self._noise_profile = None def _extract_main_voice(self, audio: AudioSegment) -> AudioSegment: """从音频中提取主要人声""" if self._noise_profile is None: logger.warning("无噪声样本可用,跳过说话人提取") return audio try: samples = np.array(audio.get_array_of_samples()) sr = audio.frame_rate reduced_noise = nr.reduce_noise( y=samples.astype(np.float32), sr=sr, y_noise=self._noise_profile, prop_decrease=0.8 ) return AudioSegment( reduced_noise.astype(np.int16).tobytes(), frame_rate=sr, sample_width=2, channels=1 ) except Exception as e: logger.warning(f"降噪处理失败: {str(e)}") return audio def _enhance_telephone_quality(self, audio: AudioSegment) -> AudioSegment: """增强电话语音质量(带通滤波)""" try: low, high = self.ENHANCEMENT_CONFIG['telephone_filter_range'] return audio.low_pass_filter(high).high_pass_filter(low) except Exception as e: logger.warning(f"电话质量增强失败: {str(e)}") return audio def _normalize_audio(self, audio: AudioSegment) -> AudioSegment: """音频归一化处理""" try: # 动态范围压缩 audio = effects.compress_dynamic_range( audio, threshold=self.ENHANCEMENT_CONFIG['compression_threshold'], ratio=self.ENHANCEMENT_CONFIG['compression_ratio'] ) # 标准化音量 return effects.normalize(audio) except Exception as e: logger.warning(f"音频标准化失败: {str(e)}") return audio def _segment_audio(self, audio: AudioSegment, input_path: str, output_dir: str) -> List[str]: """根据静音分割音频""" min_silence_len = ConfigManager().get("min_silence_len", 1000) silence_thresh = ConfigManager().get("silence_thresh", -40) try: segments = split_on_silence( audio, min_silence_len=min_silence_len, silence_thresh=silence_thresh, keep_silence=500 ) # 确保分段不超过5分钟 merged_segments = [] current_segment = AudioSegment.silent(duration=0, frame_rate=self._sample_rate) for seg in segments: if len(current_segment) + len(seg) <= self.MAX_SEGMENT_DURATION: current_segment += seg else: merged_segments.append(current_segment) current_segment = seg if len(current_segment) > 0: merged_segments.append(current_segment) # 保存分段 output_files = [] base_name = os.path.splitext(os.path.basename(input_path))[0] for i, seg in enumerate(merged_segments): output_file = os.path.join(output_dir, f"{base_name}_segment_{i + 1}.wav") seg.export(output_file, format="wav") output_files.append(output_file) return output_files except Exception as e: logger.error(f"音频分割失败: {str(e)}") return [] # ====================== ASR处理器 ====================== class ASRProcessor: def __init__(self): self.config = ConfigManager() self._asr_pipeline = None self._gpu_available = is_gpu_available() self._initialize_pipeline() def _initialize_pipeline(self): """初始化ASR管道""" model_path = self.config.get("model_paths", {}).get("asr") if not model_path: logger.error("未配置ASR模型路径") return try: device = "gpu" if self._gpu_available else "cpu" self._asr_pipeline = pipeline( task=Tasks.auto_speech_recognition, model=model_path, device=device ) logger.info(f"ASR模型初始化完成,使用设备: {device}") except Exception as e: logger.error(f"ASR模型初始化失败: {str(e)}") self._asr_pipeline = None def transcribe(self, audio_path: str) -> Optional[str]: """转录单个音频文件""" if not self._asr_pipeline: logger.error("ASR管道未初始化") return None try: result = self._asr_pipeline(audio_path) return result.get('text', '') except Exception as e: logger.error(f"音频转录失败: {str(e)}") return None def batch_transcribe(self, audio_files: List[str]) -> List[Optional[str]]: """批量转录音频文件""" if not self._asr_pipeline: logger.error("ASR管道未初始化") return [None] * len(audio_files) results = [] for audio_file in audio_files: results.append(self.transcribe(audio_file)) # 转录后立即释放内存 torch.cuda.empty_cache() if self._gpu_available else gc.collect() return results # ====================== 情感分析器 ====================== class SentimentAnalyzer: def __init__(self): self.config = ConfigManager() self._tokenizer = None self._model = None self._gpu_available = is_gpu_available() self._initialize_model() def _initialize_model(self): """初始化情感分析模型""" model_path = self.config.get("model_paths", {}).get("sentiment") if not model_path: logger.error("未配置情感分析模型路径") return try: self._tokenizer = AutoTokenizer.from_pretrained(model_path) self._model = AutoModelForSequenceClassification.from_pretrained(model_path) if self._gpu_available: self._model = self._model.cuda() logger.info("情感分析模型初始化完成") except Exception as e: logger.error(f"情感分析模型初始化失败: {str(e)}") self._tokenizer = None self._model = None def analyze(self, texts: List[str]) -> List[Dict[str, float]]: """分析文本情感""" if not self._model or not self._tokenizer: logger.error("情感分析模型未初始化") return [{"positive": 0.0, "negative": 0.0, "neutral": 0.0}] * len(texts) try: # 分批处理 batch_size = self.config.get("max_sentiment_batch_size", 16) results = [] for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] inputs = self._tokenizer( batch, padding=True, truncation=True, max_length=128, return_tensors="pt" ) if self._gpu_available: inputs = {k: v.cuda() for k, v in inputs.items()} with torch.no_grad(): outputs = self._model(**inputs) # 获取概率分布 probs = torch.nn.functional.softmax(outputs.logits, dim=-1).cpu().numpy() # 转换为字典格式 for j in range(probs.shape[0]): results.append({ "negative": float(probs[j][0]), "neutral": float(probs[j][1]), "positive": float(probs[j][2]) }) return results except Exception as e: logger.error(f"情感分析失败: {str(e)}") return [{"positive": 0.0, "negative": 0.0, "neutral": 0.0}] * len(texts) # ====================== 核心处理线程 ====================== class ProcessingThread(QThread): progress = pyqtSignal(int, str) finished = pyqtSignal(dict) error = pyqtSignal(str) def __init__(self, audio_path: str): super().__init__() self.audio_path = audio_path self.resource_monitor = EnhancedResourceMonitor() self._stop_requested = False def run(self): """处理流程主函数""" try: # 1. 初始化配置 config = ConfigManager() ok, errors = config.check_model_paths() if not ok: self.error.emit(f"模型路径配置错误: {'; '.join(errors)}") return # 2. 创建临时目录 temp_dir = tempfile.mkdtemp(prefix="dialectqa_") self.progress.emit(10, "创建临时目录完成") # 3. 预处理音频 audio_processor = EnhancedAudioProcessor() segments = audio_processor.process_audio(self.audio_path, temp_dir) if not segments: self.error.emit("音频预处理失败") return self.progress.emit(30, f"音频预处理完成,生成{len(segments)}个分段") # 4. ASR转录 asr = ASRProcessor() transcripts = asr.batch_transcribe(segments) if not any(transcripts): self.error.emit("ASR转录失败") return self.progress.emit(50, f"转录完成,总计{len(''.join(transcripts))}字") # 5. 方言预处理 transcripts = EnhancedDialectProcessor.preprocess_text(transcripts) self.progress.emit(60, "方言转换完成") # 6. 情感分析 sentiment = SentimentAnalyzer() sentiments = sentiment.analyze(transcripts) self.progress.emit(80, "情感分析完成") # 7. 关键字检测 keywords_stats = self._analyze_keywords(transcripts) self.progress.emit(90, "关键字检测完成") # 8. 结果汇总 result = { "audio_path": self.audio_path, "segments": segments, "transcripts": transcripts, "sentiments": sentiments, "keywords": keywords_stats } # 9. 清理资源 gc.collect() if self._gpu_available: torch.cuda.empty_cache() self.finished.emit(result) self.progress.emit(100, "处理完成") except Exception as e: self.error.emit(f"处理失败: {str(e)}\n{traceback.format_exc()}") finally: # 延迟清理临时目录(实际应用中可能需要保留结果) pass def _analyze_keywords(self, transcripts: List[str]) -> Dict[str, int]: """分析关键字出现频率""" stats = {category: 0 for category in EnhancedDialectProcessor.KEYWORDS} full_text = "".join(transcripts) for category, keywords in EnhancedDialectProcessor.KEYWORDS.items(): for kw in keywords: stats[category] += full_text.count(kw) return stats def stop(self): """请求停止处理""" self._stop_requested = True self.terminate() # ====================== 主界面 ====================== class DialectQAAnalyzer(QMainWindow): def __init__(self): super().__init__() self.setWindowTitle("方言客服语音质量分析系统") self.setGeometry(100, 100, 1200, 800) self.setWindowIcon(QIcon("icon.png")) # 初始化状态 self.audio_path = "" self.processing_thread = None self.results = None self._init_ui() self.check_dependencies() self.show() def _init_ui(self): """初始化用户界面""" # 创建主布局 main_widget = QWidget(self) main_layout = QVBoxLayout(main_widget) # 创建选项卡 tab_widget = QTabWidget() main_layout.addWidget(tab_widget) # 创建输入选项卡 input_tab = QWidget() input_layout = QVBoxLayout(input_tab) tab_widget.addTab(input_tab, "输入") # 音频选择区域 audio_group = QGroupBox("音频文件") audio_layout = QHBoxLayout(audio_group) self.audio_path_edit = QLineEdit() self.audio_path_edit.setReadOnly(True) audio_layout.addWidget(self.audio_path_edit, 4) browse_btn = QPushButton("浏览...") browse_btn.clicked.connect(self.select_audio) audio_layout.addWidget(browse_btn, 1) input_layout.addWidget(audio_group) # 进度区域 progress_group = QGroupBox("处理进度") progress_layout = QVBoxLayout(progress_group) self.progress_bar = QProgressBar() self.progress_bar.setRange(0, 100) self.progress_text = QLabel("准备就绪") progress_layout.addWidget(self.progress_bar) progress_layout.addWidget(self.progress_text) input_layout.addWidget(progress_group) # 操作按钮 button_layout = QHBoxLayout() self.start_btn = QPushButton("开始分析") self.start_btn.clicked.connect(self.start_processing) self.start_btn.setEnabled(False) self.stop_btn = QPushButton("停止分析") self.stop_btn.clicked.connect(self.stop_processing) self.stop_btn.setEnabled(False) button_layout.addWidget(self.start_btn) button_layout.addWidget(self.stop_btn) input_layout.addLayout(button_layout) # 结果预览区域 preview_group = QGroupBox("预览") preview_layout = QVBoxLayout(preview_group) self.preview_text = QTextEdit() self.preview_text.setReadOnly(True) preview_layout.addWidget(self.preview_text) input_layout.addWidget(preview_group) # 结果选项卡 result_tab = QWidget() result_layout = QVBoxLayout(result_tab) tab_widget.addTab(result_tab, "详细结果") # 结果表格 result_group = QGroupBox("分析明细") result_layout = QVBoxLayout(result_group) self.results_table = QTableWidget() self.results_table.setColumnCount(5) self.results_table.setHorizontalHeaderLabels(["分段", "文本内容", "积极", "中性", "消极"]) self.results_table.horizontalHeader().setSectionResizeMode(QHeaderView.Stretch) result_layout.addWidget(self.results_table) result_layout.addWidget(result_group) # 关键字统计 keywords_group = QGroupBox("关键字统计") keywords_layout = QVBoxLayout(keywords_group) self.keywords_table = QTableWidget() self.keywords_table.setColumnCount(2) self.keywords_table.setHorizontalHeaderLabels(["类别", "出现次数"]) self.keywords_table.horizontalHeader().setSectionResizeMode(QHeaderView.Stretch) keywords_layout.addWidget(self.keywords_table) result_layout.addWidget(keywords_group) # 状态栏 self.statusBar().showMessage("就绪") # 设置中心控件 self.setCentralWidget(main_widget) def check_dependencies(self): """检查系统依赖""" # 检查GPU if not is_gpu_available(): self.statusBar().showMessage("警告: 未检测到GPU,将使用CPU模式运行", 10000) # 检查FFmpeg ffmpeg_ok, ffmpeg_msg = check_ffmpeg_available() if not ffmpeg_ok: QMessageBox.warning(self, "依赖缺失", ffmpeg_msg) # 检查模型路径 config = ConfigManager() ok, errors = config.check_model_paths() if not ok: QMessageBox.warning(self, "配置错误", "\n".join(errors)) def select_audio(self): """选择音频文件""" file_path, _ = QFileDialog.getOpenFileName( self, "选择音频文件", "", "音频文件 (*.mp3 *.wav *.amr *.m4a)" ) if file_path: self.audio_path = file_path self.audio_path_edit.setText(file_path) self.start_btn.setEnabled(True) self.preview_text.setText(f"已选择文件: {file_path}") def start_processing(self): """开始处理音频""" if not self.audio_path: QMessageBox.warning(self, "错误", "请先选择音频文件") return # 禁用UI按钮 self.start_btn.setEnabled(False) self.stop_btn.setEnabled(True) self.preview_text.clear() # 创建处理线程 self.processing_thread = ProcessingThread(self.audio_path) self.processing_thread.progress.connect(self.update_progress) self.processing_thread.finished.connect(self.on_processing_finished) self.processing_thread.error.connect(self.on_processing_error) self.processing_thread.start() self.statusBar().showMessage("处理中...") def stop_processing(self): """停止处理""" if self.processing_thread and self.processing_thread.isRunning(): self.processing_thread.stop() self.stop_btn.setEnabled(False) self.statusBar().showMessage("已停止处理") def update_progress(self, value: int, message: str): """更新进度""" self.progress_bar.setValue(value) self.progress_text.setText(message) self.preview_text.append(message) def on_processing_finished(self, result: dict): """处理完成事件""" self.results = result self.stop_btn.setEnabled(False) self.start_btn.setEnabled(True) self.statusBar().showMessage("处理完成") # 更新结果表格 self.update_results_table() # 显示成功消息 QMessageBox.information(self, "完成", f"分析完成!\n音频时长: {self.calculate_audio_duration()}秒\n总字数: {len(''.join(result['transcripts']))}字") def on_processing_error(self, error: str): """处理错误事件""" self.stop_btn.setEnabled(False) self.start_btn.setEnabled(True) self.statusBar().showMessage("处理失败") # 显示错误详情 error_dialog = QDialog(self) error_dialog.setWindowTitle("处理错误") layout = QVBoxLayout() text_edit = QTextEdit() text_edit.setPlainText(error) text_edit.setReadOnly(True) layout.addWidget(text_edit) buttons = QDialogButtonBox(QDialogButtonBox.Ok) buttons.accepted.connect(error_dialog.accept) layout.addWidget(buttons) error_dialog.setLayout(layout) error_dialog.exec() def update_results_table(self): """更新结果表格""" if not self.results: return # 更新分段结果表格 segments = self.results.get("segments", []) transcripts = self.results.get("transcripts", []) sentiments = self.results.get("sentiments", []) self.results_table.setRowCount(len(segments)) for i in range(len(segments)): # 分段编号 self.results_table.setItem(i, 0, QTableWidgetItem(f"分段 {i + 1}")) # 文本内容 self.results_table.setItem(i, 1, QTableWidgetItem(transcripts[i])) # 情感分析结果 if i < len(sentiments): sentiment = sentiments[i] self.results_table.setItem(i, 2, QTableWidgetItem(f"{sentiment['positive'] * 100:.1f}%")) self.results_table.setItem(i, 3, QTableWidgetItem(f"{sentiment['neutral'] * 100:.1f}%")) self.results_table.setItem(i, 4, QTableWidgetItem(f"{sentiment['negative'] * 100:.1f}%")) # 更新关键字统计表格 keywords = self.results.get("keywords", {}) self.keywords_table.setRowCount(len(keywords)) for i, (category, count) in enumerate(keywords.items()): # 类别名称 self.keywords_table.setItem(i, 0, QTableWidgetItem(self._translate_category(category))) # 出现次数 self.keywords_table.setItem(i, 1, QTableWidgetItem(str(count))) # 根据次数设置颜色 if count > 0: for j in range(2): self.keywords_table.item(i, j).setBackground(QColor(255, 230, 230)) def _translate_category(self, category: str) -> str: """翻译关键字类别名称""" translations = { "opening": "开场白", "closing": "结束语", "forbidden": "禁用语", "salutation": "称呼语", "reassurance": "安抚语" } return translations.get(category, category) def calculate_audio_duration(self) -> float: """计算音频总时长(秒)""" if not self.audio_path or not os.path.exists(self.audio_path): return 0.0 try: audio = AudioSegment.from_file(self.audio_path) return len(audio) / 1000.0 # 转换为秒 except: return 0.0 # ====================== 主程序入口 ====================== @staticmethod def main(): # 启用高分屏支持 os.environ["QT_ENABLE_HIGHDPI_SCALING"] = "1" QApplication.setHighDpiScaleFactorRoundingPolicy(Qt.HighDpiScaleFactorRoundingPolicy.PassThrough) app = QApplication(sys.argv) app.setFont(QFont("Microsoft YaHei UI", 9)) # 设置默认字体 # 创建主窗口 window = DialectQAAnalyzer() window.show() # 检查资源 monitor = EnhancedResourceMonitor() if monitor.is_under_heavy_load(): QMessageBox.warning(window, "系统警告", "当前系统资源负载较高,性能可能受影响") # 运行应用 sys.exit(app.exec_()) if __name__ == "__main__": try: DialectQAAnalyzer.main() # 调用静态方法 except Exception as e: error_msg = f"致命错误: {str(e)}\n{traceback.format_exc()}" logger.critical(error_msg) # 创建临时错误报告 temp_file = os.path.join(os.getcwd(), "crash_report.txt") with open(temp_file, "w", encoding="utf-8") as f: f.write(error_msg) # 显示错误对话框 app = QApplication(sys.argv) msg_box = QMessageBox() msg_box.setIcon(QMessageBox.Critical) msg_box.setWindowTitle("系统崩溃") msg_box.setText("程序遇到致命错误,已终止运行") msg_box.setInformativeText(f"错误报告已保存到: {temp_file}") msg_box.exec() 运行以上代码时错先错误提示: 未解析的引用 'EnhancedDialectProcessor':164
09-09
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值