编码纠正 glob logging replace re.sub 2016.05.30回顾

本文介绍了一种使用Python处理征信报告的方法,包括通过联合查询定位特定时间点的征信记录,并采用两种策略去除敏感信息:一是利用正则表达式进行精确匹配;二是通过数据库查询并删除指定内容。
部署运行你感兴趣的模型镜像

昨天主要就是写一个脚本去掉征信中的某些敏感字段,这个事情可以分为两步,第一步是找到某个时间点之前的客户的征信报告,几个表联合查询,可以找到当时解析的filename,但是把很多filename放在一个文件夹的时候可能会出现重名的情况,所以move征信报告的时候,可能不是move的当时申请贷款对应的那一份征信报告,所以如果想找到当初那笔贷款所用到的征信报告很难,而且很难准确,而且一个客户有可能三笔贷款,涉及到3次征信报告,那岂不是要弄三份,所以基于这些问题,不需要完全精确,就好比估算一样,任何事情都有时间成本,有些时候估算便可,不需要精确计算。
然后第二步就是把找到的征信报告去掉敏感字段,我想到的有两个方法,第一个是用纯正则表达式的方法,如果正则表达式写得完全正确了(不多匹配不少匹配,应该会经过多次调试),应该是最优的方法,但是文本结构不尽相同,具体的只有边写边调试才能发现困难,或者是解决问题。这个写正则的难点是要匹配中文,涉及很多\n \t等特殊字符,可能需要strip()去掉,正则应该可以写,但是写出来偏复杂,需要多次调试,第二个方法是我去数据库查那些解析出来的信息,然后在原来的html文本中去掉这些解析出来的信息,但是由于第一步所说的move征信的时候存在的问题,所以这个方法可能存在一些小的遗漏,因为不是当时的那份报告,或者也有可能出现,被先前匹配字串删掉主体,残留下一些诸如重庆市这样的文字,我突然想到还有一种方法,这是基于我之前的解析程序,就是对于找到的html,现跑我之前的解析程序,然后用配置文件的方法,找对应的字段,找到字段后,用string.replace()去删除掉,这个方法应该也可以比较完美的解决,但是消耗比较大,要动用以前那一大套解析程序!
目前采用第二种方法解决的问题。
excel复制html源代码进去,居然可以解析,我瞎了!查了下一个单元格应该是可以放下2的15次方个字符,至于这个和字节的关系我没深究,反正放不下html源代码数据。
还接触了一个类似于windows中查找文件的模块,叫glob。
logging模块除了info、warning、debug还有error和critical。
python删除字符串的子串,应该是用替换完成string.replace(‘欲替换的子串’,’替换内容’),还有就是模式替换,用到re模块,re.sub(字符串,模式,’替换内容’)。
最后我还研究了下编码的问题,我发现我上周五的理解是不正确的,昨天的研究就是,代码编码是什么,所写的字符串就什么编码,至于行首的coding:utf-8,这是给解释器说明的,解释器知道编码是什么,如果不一致,比如代码编写用utf-8,其中有某些字符串,而行首是#coding:cp936,则解释器能解释通代码就不会报错(这个检测就好比把代码全转成cp936,如果能转到cp936对应的乱码字符就不会出错,如出现?等通配符不能解析的情况会报错<包括注释>,通不过编译),也就是说只要能通过编译了,输出是什么编码就根据你编写代码所用到的编码来确定;写入windows的文件名必须是gbk编码的,不是说告诉程序写入这样一个文件名的指令,程序就能自动实现转换的(上周就是错在这里),gbk字符串+utf-8字符串,是一个混合型的字符串,格式化unicode变量格式化字符串整个字符串会变成unicode!
psycopg2的cur.fetchall()结果是二维的,list里面套tuple,一个list是一条record,tuple里面是分别的列!
差不多以上!

您可能感兴趣的与本文相关的镜像

Python3.11

Python3.11

Conda
Python

Python 是一种高级、解释型、通用的编程语言,以其简洁易读的语法而闻名,适用于广泛的应用,包括Web开发、数据分析、人工智能和自动化脚本

我做的translate.py现在有很多问题。如何解决这些问题还有潜在的问题预防。目录结构式怎么样的。如何完善我的逻辑,和我同样的开发者遇到过那些坑。import os import re import json import yaml import logging import argparse from pathlib import Path from lxml import etree from transformers import MarianMTModel, MarianTokenizer, logging as hf_logging from typing import Dict, List, Tuple # -------------------------------------------------- # 日志配置 # -------------------------------------------------- logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", handlers=[ logging.FileHandler("translation.log", encoding="utf-8"), logging.StreamHandler() ] ) hf_logging.set_verbosity_error() XML_NS = "http://www.w3.org/XML/1998/namespace" # -------------------------------------------------- # 配置与术语表加载 # -------------------------------------------------- def load_config(config_path: str = "config.yaml") -> dict: try: with open(config_path, "r", encoding="utf-8") as f: cfg = yaml.safe_load(f) or {} defaults = { "source_language": "en", "target_languages": ["fr", "de", "es"], "input_dir": "xliff_in", "output_dir": "output", "qa_report_dir": "qa_reports", "model_template": "Helsinki-NLP/opus-mt-{src}-{tgt}", "generate_qa_only": False } for k, v in defaults.items(): cfg.setdefault(k, v) # 兼容旧配置键名 if "qa_report" in cfg: cfg["qa_report_dir"] = cfg["qa_report"] if "target_languages" in cfg: cfg["target_language"] = cfg["target_languages"] elif "target_language" not in cfg: cfg["target_language"] = defaults["target_languages"] # 创建输出目录 Path(cfg["output_dir"]).mkdir(exist_ok=True) Path(cfg["qa_report_dir"]).mkdir(exist_ok=True) return cfg except FileNotFoundError: logging.error(f"配置文件 {config_path} 未找到!") raise except Exception as e: logging.error(f"加载配置失败: {e}") raise def load_glossary(glossary_path: str = "glossary.json") -> dict: try: if os.path.exists(glossary_path) and os.path.getsize(glossary_path) > 0: with open(glossary_path, "r", encoding="utf-8") as f: return json.load(f) or {} logging.warning(f"术语表 {glossary_path} 不存在或为空,将不使用术语表") return {} except Exception as e: logging.error(f"加载术语表失败({glossary_path}): {e}") return {} # -------------------------------------------------- # 模型管理 # -------------------------------------------------- class ModelManager: def __init__(self, model_template: str): self.model_cache = {} self.model_template = model_template self.supported_langs = {"en", "fr", "de", "es", "ar", "pt", "ja", "ru", "zh"} def is_supported(self, src_lang: str, tgt_lang: str) -> bool: return src_lang in self.supported_langs and tgt_lang in self.supported_langs def get_model(self, src_lang: str, tgt_lang: str): key = f"{src_lang}-{tgt_lang}" if key not in self.model_cache: model_name = self.model_template.format(src=src_lang, tgt=tgt_lang) logging.info(f"加载模型: {model_name}") tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) self.model_cache[key] = {"tokenizer": tokenizer, "model": model} logging.info(f"模型 {model_name} 加载成功") return self.model_cache[key] # -------------------------------------------------- # 核心修复:内容保护机制(防篡改+全场景匹配) # -------------------------------------------------- def generate_robust_placeholder(idx: int) -> str: """生成模型无法识别和修改的防篡改占位符""" return f"[[XLFF_PROTECT_{idx}_SAFE]]" def protect_content(text: str) -> Tuple[str, Dict[str, str]]: """ 保护HTML标签(含属性)和短代码,解决: 1. 占位符被模型篡改问题 2. 标签属性被翻译问题 3. 短代码识别不全问题 """ if not text: return text, {} protected = text tag_map = {} idx = 0 # 1. 保护HTML标签(含所有属性,避免模型修改style等属性) html_pattern = r'(<\/?[a-zA-Z0-9]+(?:\s+[a-zA-Z0-9\-]+(?:="[^"]*")?)*\s*\/?>)' html_matches = list(re.finditer(html_pattern, protected, re.IGNORECASE)) # 倒序替换避免干扰后续匹配 for match in reversed(html_matches): original_tag = match.group(1) placeholder = generate_robust_placeholder(idx) protected = protected[:match.start()] + placeholder + protected[match.end():] tag_map[placeholder] = original_tag idx += 1 # 2. 保护短代码(支持前后空白和复杂属性) shortcode_pattern = r'(\s*\[[a-zA-Z0-9\_\-]+\s*(?:[a-zA-Z0-9\_\-]+\="[^"]*"\s*)*\/?\]\s*)' shortcode_matches = list(re.finditer(shortcode_pattern, protected, re.IGNORECASE)) for match in reversed(shortcode_matches): original_sc = match.group(1).strip() # 去除空白但保留原始内容 placeholder = generate_robust_placeholder(idx) protected = protected[:match.start()] + placeholder + protected[match.end():] tag_map[placeholder] = original_sc idx += 1 return protected, tag_map def restore_content(text: str, tag_map: Dict[str, str]) -> Tuple[str, bool]: """严格还原占位符,处理模型可能的篡改""" if not text or not tag_map: return text, True restored = text remaining_placeholders = [] # 按占位符长度倒序替换,避免短占位符被长占位符匹配 sorted_placeholders = sorted(tag_map.keys(), key=lambda x: len(x), reverse=True) for placeholder in sorted_placeholders: original = tag_map[placeholder] if placeholder in restored: restored = restored.replace(placeholder, original) else: # 检查是否有被篡改的占位符(如少了下划线或前缀) tampered_pattern = re.compile(re.escape(placeholder).replace(r'\[', r'\[?').replace(r'\]', r'\]?')) if tampered_pattern.search(restored): restored = tampered_pattern.sub(original, restored) logging.warning(f"修复被篡改的占位符: {placeholder}") else: remaining_placeholders.append(placeholder) # 处理未还原的占位符(强制补全) if remaining_placeholders: logging.error(f"未还原的占位符: {remaining_placeholders}") for ph in remaining_placeholders: restored += f" {tag_map[ph]}" # 追加原始内容避免结构丢失 return restored, False return restored, True # -------------------------------------------------- # 翻译与检查逻辑 # -------------------------------------------------- BRAND_NAMES = {"Flatsome", "WPML", "Helsinki-NLP", "MarianMT", "Petsva"} def should_translate(text: str, tgt_lang: str, glossary: dict) -> bool: s = (text or "").strip() if not s: return False if s in BRAND_NAMES: return False if re.fullmatch(r'^[\d\.,\s]+(?:USD|EUR|¥|\$|€)?$', s): return False if re.fullmatch(r'\[wpml[^\]]*\]', s, re.IGNORECASE): return False if re.fullmatch(r'^SKU[-_]\d+$', s, re.IGNORECASE): return False if re.fullmatch(r'https?://.*|www\..*|/[\w\-/]+', s): return False if s in glossary and tgt_lang in glossary[s]: return False if re.search(r'(email|password|username|account|phone|tel|contact)', s, re.IGNORECASE): return False return True def apply_glossary(text: str, tgt_lang: str, glossary: dict) -> Tuple[str, bool]: """应用术语表,支持多词术语和模糊匹配""" if not text or not glossary: return text, False processed_text = text hit_glossary = False sorted_terms = sorted(glossary.items(), key=lambda x: len(x[0]), reverse=True) for src_term, tgt_dict in sorted_terms: if tgt_lang not in tgt_dict: continue tgt_term = tgt_dict[tgt_lang] pattern = re.compile(re.escape(src_term), re.IGNORECASE) processed_text = pattern.sub(tgt_term, processed_text) if pattern.search(text): hit_glossary = True return processed_text, hit_glossary def filter_invalid_input(text: str) -> Tuple[str, bool]: """过滤无意义输入,避免无效翻译""" s = text.strip() if len(s) < 2: return text, False if re.fullmatch(r'[\s\*\-=_+#@~`!%^&()\[\]{}|\\;:\'",./<>?]+', s): return text, False if re.fullmatch(r'^\d+$', s): return text, False return text, True def translate_text(text: str, tgt_lang: str, model_manager: ModelManager, glossary: dict, src_lang: str = "en") -> str: s = (text or "").strip() original_text = text # 保存原文用于回退 # 1. 过滤无意义输入 filtered_text, is_valid = filter_invalid_input(s) if not is_valid: logging.debug(f"无意义输入,跳过翻译:{s[:30]}...") return original_text # 2. 应用术语表 term_processed_text, hit_glossary = apply_glossary(filtered_text, tgt_lang, glossary) if hit_glossary: logging.debug(f"术语表命中,直接返回:{term_processed_text[:30]}...") return term_processed_text # 3. 检查语言支持 if not model_manager.is_supported(src_lang, tgt_lang): logging.warning(f"不支持语言对 {src_lang}->{tgt_lang},返回原文") return original_text # 4. 内容保护 protected_text, tag_map = protect_content(term_processed_text) try: # 5. 模型翻译 model_pack = model_manager.get_model(src_lang, tgt_lang) tokenizer, model = model_pack["tokenizer"], model_pack["model"] inputs = tokenizer( protected_text, return_tensors="pt", padding=True, truncation=True, max_length=512, add_special_tokens=True ) out = model.generate( **inputs, max_length=1024, num_beams=4, early_stopping=True, no_repeat_ngram_size=2 # 避免重复文本 ) translated_protected = tokenizer.decode(out[0], skip_special_tokens=True) # 6. 内容还原 translated_text, is_restore_complete = restore_content(translated_protected, tag_map) if not is_restore_complete: logging.warning(f"内容还原不完整,已尝试修复:{translated_text[:30]}...") # 7. 标签完整性修复 translated_text = fix_tag_integrity(translated_text) if not is_valid_translation(translated_text, original_text): logging.error(f"翻译结果无效,回退原文:{translated_text[:30]}...") return fix_tag_integrity(original_text) return translated_text except Exception as e: logging.error(f"翻译失败,回退原文:{s[:30]}... 错误:{str(e)[:50]}") return fix_tag_integrity(original_text) def fix_tag_integrity(txt: str) -> str: """修复未闭合的HTML标签""" if not txt: return txt open_tags = [] # 提取所有开放标签(排除自闭合标签) for match in re.finditer(r'<([a-zA-Z0-9]+)[^>]*>', txt): tag = match.group(1).lower() if not re.search(r'\/\s*>$', match.group(0)): # 非自闭合标签 open_tags.append(tag) # 提取所有闭合标签 close_tags = [match.group(1).lower() for match in re.finditer(r'</([a-zA-Z0-9]+)>', txt)] # 补全缺失的闭合标签 result = txt for tag in reversed(open_tags): if tag in close_tags: close_tags.remove(tag) else: result += f"</{tag}>" logging.warning(f"补全未闭合标签:</{tag}>") return result def is_valid_translation(translated: str, original: str) -> bool: """校验翻译有效性""" translated_stripped = translated.strip() original_stripped = original.strip() # 长度校验 if len(translated_stripped) < max(1, len(original_stripped) * 0.3): return False # 残留占位符校验 if re.search(r'\[\[XLFF_PROTECT_\d+_SAFE\]\]', translated_stripped): return False # 标签数量校验(允许±1的误差) original_tag_count = len(re.findall(r'<[^>]+>', original_stripped)) translated_tag_count = len(re.findall(r'<[^>]+>', translated_stripped)) if abs(original_tag_count - translated_tag_count) > 1: return False return True def check_consistency(source: str, target: str, tgt_lang: str) -> list: issues = [] source_stripped = (source or "").strip() target_stripped = (target or "").strip() if not source_stripped: return issues # 1. 标签校验 def extract_elements(text: str, pattern: str) -> list: return re.findall(pattern, text or "", re.IGNORECASE) tag_pattern = r'(<\/?[a-zA-Z0-9]+(?:\s+[a-zA-Z0-9\-]+(?:="[^"]*")?)*\s*\/?>)' source_tags = extract_elements(source, tag_pattern) target_tags = extract_elements(target, tag_pattern) if len(source_tags) != len(target_tags): issues.append(f"标签数量不一致:源[{len(source_tags)}] → 目标[{len(target_tags)}]") else: for i, (src_tag, tgt_tag) in enumerate(zip(source_tags, target_tags)): if src_tag.lower() != tgt_tag.lower(): issues.append(f"标签不匹配(位置{i+1}):源[{src_tag}] → 目标[{tgt_tag}]") # 2. 短代码校验 sc_pattern = r'(\[[a-zA-Z0-9\_\-]+\s*(?:[a-zA-Z0-9\_\-]+\="[^"]*"\s*)*\/?\])' source_sc = extract_elements(source, sc_pattern) target_sc = extract_elements(target, sc_pattern) if len(source_sc) != len(target_sc): issues.append(f"短代码数量不一致:源[{len(source_sc)}] → 目标[{len(target_sc)}]") else: for i, (src_sc, tgt_sc) in enumerate(zip(source_sc, target_sc)): if src_sc != tgt_sc: issues.append(f"短代码不匹配(位置{i+1}):源[{src_sc}] → 目标[{tgt_sc}]") # 3. 占位符校验 placeholder_pattern = r'\{[^}]+\}' source_ph = extract_elements(source, placeholder_pattern) target_ph = extract_elements(target, placeholder_pattern) if len(source_ph) != len(target_ph): issues.append(f"占位符数量不一致:源[{len(source_ph)}] → 目标[{len(target_ph)}]") # 4. 数字/金额一致性 def extract_numbers(text: str) -> List[str]: patterns = [r'\$\d+\.?\d*', r'€\d+\.?\d*', r'¥\d+\.?\d*', r'£\d+\.?\d*', r'\b\d+\.\d{2}\b'] numbers = [] for p in patterns: numbers.extend(re.findall(p, text)) return numbers s_num = extract_numbers(source) t_num = extract_numbers(target) if s_num != t_num: issues.append(f"数字不一致:源[{','.join(s_num)}] → 目标[{','.join(t_num)}]") return issues # -------------------------------------------------- # XLIFF处理 # -------------------------------------------------- def ensure_target_element(trans_unit, ns, tgt_lang: str, default_state: str = "translated"): target = trans_unit.find(f"{{{ns}}}target") if target is None: target = etree.SubElement(trans_unit, "target") target.set(f"{{{XML_NS}}}lang", tgt_lang) if "state" not in target.attrib: target.set("state", default_state) return target def build_qa_root_like(original_root): qa_root = etree.Element(original_root.tag, nsmap=original_root.nsmap) for k, v in original_root.attrib.items(): qa_root.set(k, v) return qa_root def process_xliff(file_path: Path, model_manager: ModelManager, glossary: dict, config: dict): try: logging.info(f"开始处理文件: {file_path.name}") parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(str(file_path), parser) root = tree.getroot() ns = root.nsmap.get(None, "urn:oasis:names:tc:xliff:document:1.2") tool_ns = root.nsmap.get("tool", "https://cdn.wpml.org/xliff/custom-attributes.xsd") file_elem = root.find(f".//{{{ns}}}file") if file_elem is None: logging.warning(f"文件 {file_path.name} 无 <file> 标签,跳过") return src_lang = file_elem.get("source-language") or config["source_language"] tgt_lang = file_elem.get("target-language") or ( config["target_language"][0] if isinstance(config["target_language"], list) else config["target_language"] ) if src_lang == tgt_lang: logging.info(f"源语言与目标语言相同({src_lang}),跳过文件") return qa_root = build_qa_root_like(root) qa_file = etree.SubElement(qa_root, "file", attrib=file_elem.attrib) qa_body = etree.SubElement(qa_file, "body") trans_units = file_elem.findall(f".//{{{ns}}}body/{{{ns}}}trans-unit") total = len(trans_units) if total == 0: logging.warning(f"文件 {file_path.name} 无翻译单元,跳过") return logging.info(f"找到 {total} 个翻译单元") write_translated = not bool(config.get("generate_qa_only", False)) for i, tu in enumerate(trans_units, 1): if i % 10 == 0 or i == total: logging.info(f"进度 {i}/{total}") source_el = tu.find(f"{{{ns}}}source") if source_el is None: continue src_text = source_el.text or "" translated_text = translate_text(src_text, tgt_lang, model_manager, glossary, src_lang) issues = check_consistency(src_text, translated_text, tgt_lang) if write_translated: target_el = ensure_target_element(tu, ns, tgt_lang) target_el.text = translated_text if issues: tu_id = tu.get("id") or f"unit-{i}" qa_tu = etree.SubElement(qa_body, "trans-unit", id=tu_id) extradata = tu.find(f"{{{tool_ns}}}extradata") if extradata is not None: qa_ex = etree.SubElement(qa_tu, f"{{{tool_ns}}}extradata", attrib=extradata.attrib) qa_ex.text = extradata.text etree.SubElement(qa_tu, "source").text = src_text t = etree.SubElement(qa_tu, "target") t.set(f"{{{XML_NS}}}lang", tgt_lang) t.set("state", "needs-review-translation") t.text = translated_text etree.SubElement(qa_tu, "note").text = "; ".join(issues) if write_translated: out_file = Path(config["output_dir"]) / f"{file_path.stem}_translated.xliff" tree.write(str(out_file), encoding="utf-8", xml_declaration=True, pretty_print=True) logging.info(f"已保存翻译文件: {out_file}") if len(qa_body) > 0: qa_file_out = Path(config["qa_report_dir"]) / f"{file_path.stem}_qa.xliff" qa_tree = etree.ElementTree(qa_root) qa_tree.write(str(qa_file_out), encoding="utf-8", xml_declaration=True, pretty_print=True) logging.info(f"QA 文件已保存至 {qa_file_out}") except Exception as e: logging.error(f"处理文件 {file_path.name} 出错: {e}", exc_info=True) def main(): parser = argparse.ArgumentParser(description="XLIFF翻译工具(修复版)") parser.add_argument("--config", default="config.yaml", help="配置文件路径") args = parser.parse_args() config = load_config(args.config) glossary = load_glossary() model_manager = ModelManager(config["model_template"]) input_dir = Path(config["input_dir"]) for xliff_file in input_dir.glob("*.xliff"): process_xliff(xliff_file, model_manager, glossary, config) if __name__ == "__main__": main() 下面的是config.yaml
09-02
这个是我的源文件translate.py和配置文件config.yaml import os import re import json import yaml import logging import argparse from pathlib import Path from lxml import etree from transformers import MarianMTModel, MarianTokenizer, logging as hf_logging from typing import Dict, List, Tuple # -------------------------------------------------- # 日志 # -------------------------------------------------- logging.basicConfig( level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", handlers=[ logging.FileHandler("translation.log", encoding="utf-8"), logging.StreamHandler() ] ) hf_logging.set_verbosity_error() XML_NS = "http://www.w3.org/XML/1998/namespace" # -------------------------------------------------- # 配置 & 术语表 # -------------------------------------------------- def load_config(config_path: str = "config.yaml") -> dict: try: with open(config_path, "r", encoding="utf-8") as f: cfg = yaml.safe_load(f) or {} defaults = { "source_language": "en", "target_languages": ["fr", "de", "es"], "input_dir": "xliff_in", "output_dir": "output", "qa_report_dir": "qa_reports", "model_template": "Helsinki-NLP/opus-mt-{src}-{tgt}", "generate_qa_only": False } for k, v in defaults.items(): cfg.setdefault(k, v) # 处理旧的键名兼容 if "qa_report" in cfg: cfg["qa_report_dir"] = cfg["qa_report"] if "target_languages" in cfg: cfg["target_language"] = cfg["target_languages"] elif "target_language" not in cfg: cfg["target_language"] = defaults["target_languages"] # 创建输出目录 Path(cfg["output_dir"]).mkdir(exist_ok=True) Path(cfg["qa_report_dir"]).mkdir(exist_ok=True) return cfg except FileNotFoundError: logging.error(f"配置文件 {config_path} 未找到!") raise except Exception as e: logging.error(f"加载配置失败: {e}") raise def load_glossary(glossary_path: str = "glossary.json") -> dict: try: if os.path.exists(glossary_path) and os.path.getsize(glossary_path) > 0: with open(glossary_path, "r", encoding="utf-8") as f: return json.load(f) or {} logging.warning(f"术语表 {glossary_path} 不存在或为空,将不使用术语表") return {} except Exception as e: logging.error(f"加载术语表失败({glossary_path}): {e}") return {} # -------------------------------------------------- # 模型管理 # -------------------------------------------------- class ModelManager: def __init__(self, model_template: str): self.model_cache = {} self.model_template = model_template self.supported_langs = {"en", "fr", "de", "es", "ar", "pt", "ja", "ru", "zh"} def is_supported(self, src_lang: str, tgt_lang: str) -> bool: return src_lang in self.supported_langs and tgt_lang in self.supported_langs def get_model(self, src_lang: str, tgt_lang: str): key = f"{src_lang}-{tgt_lang}" if key not in self.model_cache: model_name = self.model_template.format(src=src_lang, tgt=tgt_lang) logging.info(f"加载模型: {model_name}") tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) self.model_cache[key] = {"tokenizer": tokenizer, "model": model} logging.info(f"模型 {model_name} 加载成功") return self.model_cache[key] # -------------------------------------------------- # 内容保护函数 # -------------------------------------------------- def protect_content(text: str) -> Tuple[str, Dict[str, str]]: """保护HTML标签和短代码""" if not text: return text, {} protected = text tag_map = {} idx = 0 # 保护短代码 [shortcode] shortcodes = re.findall(r'(\[[^\]]+\])', protected) for sc in shortcodes: key = f"__SC{idx}__" tag_map[key] = sc protected = protected.replace(sc, key, 1) idx += 1 # 保护HTML标签 <tag> html_tags = re.findall(r'(<[^>]+>)', protected) for tag in html_tags: key = f"__HTML{idx}__" tag_map[key] = tag protected = protected.replace(tag, key, 1) idx += 1 return protected, tag_map def restore_content(text: str, tag_map: Dict[str, str]) -> str: """还原被保护的内容""" if not text or not tag_map: return text restored = text for key, original in tag_map.items(): restored = restored.replace(key, original) return restored # -------------------------------------------------- # 翻译 & 检查 # -------------------------------------------------- BRAND_NAMES = {"Flatsome", "WPML", "Helsinki-NLP", "MarianMT", "Petsva"} def should_translate(text: str, tgt_lang: str, glossary: dict) -> bool: s = (text or "").strip() if not s: return False if s in BRAND_NAMES: return False if re.fullmatch(r'^[\d\.,\s]+(?:USD|EUR|¥|\$|€)?$', s): return False # 检查WPML短代码 - 如果整个文本都是短代码,则不翻译 if re.fullmatch(r'\[wpml[^\]]*\]', s, re.IGNORECASE): return False if re.fullmatch(r'^SKU[-_]\d+$', s, re.IGNORECASE): return False if re.fullmatch(r'https?://.*|www\..*|/[\w\-/]+', s): return False if s in glossary and tgt_lang in glossary[s]: return False if re.search(r'(email|password|username|account|phone|tel|contact)', s, re.IGNORECASE): return False return True def translate_text(text: str, tgt_lang: str, model_manager: ModelManager, glossary: dict, src_lang: str = "en") -> str: s = (text or "").strip() # 1) 术语表优先 if s in glossary and tgt_lang in glossary[s]: return glossary[s][tgt_lang] # 2) 不需要翻译直接返回 if not should_translate(text, tgt_lang, glossary): return text or "" # 3) 语言支持 if not model_manager.is_supported(src_lang, tgt_lang): logging.warning(f"不支持语言对 {src_lang}->{tgt_lang},返回原文") return text or "" # 4) 保护内容(HTML标签 + 短代码) protected, tag_map = protect_content(text) try: model_pack = model_manager.get_model(src_lang, tgt_lang) tokenizer, model = model_pack["tokenizer"], model_pack["model"] # 5) 翻译 inputs = tokenizer(protected, return_tensors="pt", padding=True, truncation=True, max_length=512) out = model.generate(**inputs) translated = tokenizer.decode(out[0], skip_special_tokens=True) # 6) 还原保护内容 translated = restore_content(translated, tag_map) # 7) 修复标签完整性 translated = fix_tag_integrity(translated) # 8) 有效性检查 if not is_valid_translation(translated): logging.error(f"翻译结果无效,回退使用源文本") return fix_tag_integrity(text or "") return translated except Exception as e: logging.error(f"翻译失败: {s[:60]}... 错误: {e}") return fix_tag_integrity(text or "") def fix_tag_integrity(txt: str) -> str: """修复标签完整性""" if not txt: return txt # 简单的标签平衡检查 open_tags = re.findall(r'<([a-zA-Z0-9]+)[^>]*>', txt) close_tags = re.findall(r'</([a-zA-Z0-9]+)>', txt) # 计算需要补全的标签 for tag in close_tags: if tag in open_tags: open_tags.remove(tag) # 补全缺失的闭合标签 result = txt for tag in reversed(open_tags): result += f'</{tag}>' logging.warning(f"补全未闭合标签:</{tag}>") return result def is_valid_translation(txt: str) -> bool: s = (txt or "").strip() if len(s) < 2: return False # 特殊字符占比 specials = len(re.findall(r'[*\-=_+#@]', s)) if specials and specials / max(1, len(s)) > 0.3: return False # 不全是重复符号 if re.fullmatch(r'([*\-=])\1{6,}', s): return False return True def check_consistency(source: str, target: str) -> list: issues = [] if not (source or "").strip(): return issues # 数字/金额一致性 - 更精确的提取 def extract_important_numbers(text: str) -> List[str]: """只提取重要的数字(货币等),避免日期、页码等""" patterns = [ r'\$\d+\.?\d*', # 美元金额 r'€\d+\.?\d*', # 欧元金额 r'¥\d+\.?\d*', # 日元金额 r'£\d+\.?\d*', # 英镑金额 r'\b\d+\.\d{2}\b', # 金额格式的小数 ] numbers = [] for pattern in patterns: numbers.extend(re.findall(pattern, text)) return numbers s_num = extract_important_numbers(source or "") t_num = extract_important_numbers(target or "") if s_num != t_num: issues.append(f"数字或金额不一致:源[{','.join(s_num)}] → 目标[{','.join(t_num)}]") # {} 占位符 def extract_placeholders(text: str) -> list: return re.findall(r'\{[^}]+\}', text or "") sp = extract_placeholders(source or "") tp = extract_placeholders(target or "") if len(sp) != len(tp): issues.append(f"占位符数量不一致:源{len(sp)}个 → 目标{len(tp)}个") else: semantic_maps = { "physical address": ["adresse physique", "physische adresse", "dirección física"], "email": ["email", "e-mail", "correo electrónico"], "phone": ["téléphone", "telefon", "teléfono"], "name": ["nom", "name", "nombre"], "order id": ["numéro de commande", "bestellnummer", "número de pedido"] } for s_ph, t_ph in zip(sp, tp): if s_ph == t_ph: continue s_core = s_ph.strip("{}").lower() t_core = t_ph.strip("{}").lower() ok = False if s_core in semantic_maps: ok = t_core in semantic_maps[s_core] else: cores = [w for w in s_core.split() if len(w) > 2] ok = any(c in t_core for c in cores) if not ok: issues.append(f"占位符内容不匹配:源{s_ph} → 目标{t_ph}") # 短代码一致性检查 def extract_shortcodes(text: str) -> list: return re.findall(r'\[[^\]]+\]', text or "") source_shortcodes = extract_shortcodes(source or "") target_shortcodes = extract_shortcodes(target or "") if len(source_shortcodes) != len(target_shortcodes): issues.append(f"短代码数量不一致:源{len(source_shortcodes)}个 → 目标{len(target_shortcodes)}个") else: for i, (src_sc, tgt_sc) in enumerate(zip(source_shortcodes, target_shortcodes)): if src_sc != tgt_sc: issues.append(f"短代码内容不匹配:位置{i+1}, 源={src_sc}, 目标={tgt_sc}") # 标签一致性 def extract_tags(text: str) -> list: tags = [t for t in re.findall(r"<[^>]+>", text or "") if t.strip()] return [t for t in tags if not re.search(r'img', t, re.I)] st = extract_tags(source or "") tt = extract_tags(target or "") if len(st) != len(tt): issues.append(f"标签数量不一致:源{len(st)}个 → 目标{len(tt)}个") # 目标有效性 t_stripped = (target or "").strip() if not t_stripped or re.fullmatch(r'([*\-=\_\+\#\@\s])+', t_stripped) or len(t_stripped) < max(1, len((source or "").strip())) * 0.3: issues.append("目标文本无效:全为特殊字符/空白/过度截断,无有效内容") return issues # -------------------------------------------------- # XLIFF 处理 # -------------------------------------------------- def ensure_target_element(trans_unit, ns, tgt_lang: str, default_state: str = "translated"): target = trans_unit.find(f"{{{ns}}}target") if target is None: target = etree.SubElement(trans_unit, "target") target.set(f"{{{XML_NS}}}lang", tgt_lang) if "state" not in target.attrib: target.set("state", default_state) return target def build_qa_root_like(original_root): qa_root = etree.Element(original_root.tag, nsmap=original_root.nsmap) for k, v in original_root.attrib.items(): qa_root.set(k, v) return qa_root def process_xliff(file_path: Path, model_manager: ModelManager, glossary: dict, config: dict): try: logging.info(f"开始处理文件: {file_path.name}") parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(str(file_path), parser) root = tree.getroot() ns = root.nsmap.get(None, "urn:oasis:names:tc:xliff:document:1.2") tool_ns = root.nsmap.get("tool", "https://cdn.wpml.org/xliff/custom-attributes.xsd") file_elem = root.find(f".//{{{ns}}}file") if file_elem is None: logging.warning(f"文件 {file_path.name} 无 <file> 标签,跳过") return src_lang = file_elem.get("source-language") or config["source_language"] tgt_lang = file_elem.get("target-language") or ( config["target_language"][0] if isinstance(config["target_language"], list) else config["target_language"] ) if src_lang == tgt_lang: logging.info(f"源语言与目标语言相同({src_lang}),跳过文件 {file_path.name}") return qa_root = build_qa_root_like(root) qa_file = etree.SubElement(qa_root, "file", attrib=file_elem.attrib) qa_body = etree.SubElement(qa_file, "body") trans_units = file_elem.findall(f".//{{{ns}}}body/{{{ns}}}trans-unit") total = len(trans_units) if total == 0: logging.warning(f"文件 {file_path.name} 无翻译单元,跳过") return logging.info(f"找到 {total} 个翻译单元") write_translated = not bool(config.get("generate_qa_only", False)) for i, tu in enumerate(trans_units, 1): if i % 10 == 0 or i == total: logging.info(f"进度 {i}/{total}") source_el = tu.find(f"{{{ns}}}source") if source_el is None: continue src_text = source_el.text or "" translated_text = translate_text(src_text, tgt_lang, model_manager, glossary, src_lang) issues = check_consistency(src_text, translated_text) if write_translated: target_el = ensure_target_element(tu, ns, tgt_lang, default_state="translated") target_el.text = translated_text if issues: tu_id = tu.get("id") or f"unit-{i}" qa_tu = etree.SubElement(qa_body, "trans-unit", id=tu_id) extradata = tu.find(f"{{{tool_ns}}}extradata") if extradata is not None: qa_ex = etree.SubElement(qa_tu, f"{{{tool_ns}}}extradata", attrib=extradata.attrib) qa_ex.text = extradata.text etree.SubElement(qa_tu, "source").text = src_text t = etree.SubElement(qa_tu, "target") t.set(f"{{{XML_NS}}}lang", tgt_lang) t.set("state", "needs-review-translation") t.text = translated_text etree.SubElement(qa_tu, "note").text = "; ".join(issues) out_dir = Path(config["output_dir"]) out_dir.mkdir(exist_ok=True) if write_translated: out_file = out_dir / f"{file_path.stem}_translated.xliff" tree.write(str(out_file), encoding="utf-8", xml_declaration=True, pretty_print=True) logging.info(f"已保存翻译文件: {out_file}") has_qa = len(qa_body) > 0 if has_qa: qa_dir = Path(config["qa_report_dir"]) qa_dir.mkdir(exist_ok=True) qa_file_out = qa_dir / f"{file_path.stem}_qa.xliff" # 修复:保存QA根结构 qa_tree = etree.ElementTree(qa_root) qa_tree.write(str(qa_file_out), encoding="utf-8", xml_declaration=True, pretty_print=True) logging.info(f"QA 文件已保存至 {qa_file_out}") except Exception as e: logging.error(f"处理文件 {file_path.name} 出错: {e}") def main(): parser = argparse.ArgumentParser(description="处理 XLIFF 文件的翻译") parser.add_argument("--config", default="config.yaml", help="配置文件路径") args = parser.parse_args() config = load_config(args.config) glossary = load_glossary() model_manager = ModelManager(config["model_template"]) input_dir = Path(config["input_dir"]) for xliff_file in input_dir.glob("*.xliff"): process_xliff(xliff_file, model_manager, glossary, config) if __name__ == "__main__": main()# 源语言 source_language: en # 目标语言列表(随时可以扩展) target_languages: - fr - de - es # 后期可新增:ar, pt, ja, ru, zh, ... # 输入输出目录 input_dir: "xliff_in" output_dir: "output" qa_report: qa_reports generate_qa_only: true # 模型选择模板 # {src} 和 {tgt} 会自动替换为源语言和目标语言 model_template: "Helsinki-NLP/opus-mt-{src}-{tgt}" # cmd 里面输入.venv\Scripts\activate.bat 激活环境
09-01
import os import re import json import yaml import logging import argparse import concurrent.futures import threading from pathlib import Path from lxml import etree from transformers import MarianMTModel, MarianTokenizer, logging as hf_logging from typing import Dict, List, Tuple, Optional # -------------------------------------------------- # 日志优化 - 减少I/O开销 # -------------------------------------------------- def setup_logging(): logger = logging.getLogger() logger.setLevel(logging.INFO) # 优化文件日志写入频率 file_handler = logging.FileHandler("translation.log", encoding="utf-8") file_handler.setLevel(logging.INFO) file_handler.setFormatter(logging.Formatter("%(asctime)s - %(levelname)s - %(message)s")) # 控制台日志只输出警告及以上 console_handler = logging.StreamHandler() console_handler.setLevel(logging.WARNING) console_handler.setFormatter(logging.Formatter("%(levelname)s - %(message)s")) logger.addHandler(file_handler) logger.addHandler(console_handler) hf_logging.set_verbosity_error() XML_NS = "http://www.w3.org/XML/1998/namespace" # -------------------------------------------------- # 配置 & 术语表 - 添加缓存 # -------------------------------------------------- class ConfigManager: _instance = None _lock = threading.Lock() def __new__(cls, config_path="config.yaml"): if cls._instance is None: with cls._lock: if cls._instance is None: cls._instance = super().__new__(cls) cls._instance.load_config(config_path) return cls._instance def load_config(self, config_path: str): try: with open(config_path, "r", encoding="utf-8") as f: cfg = yaml.safe_load(f) or {} # 设置默认值 self.source_language = cfg.get("source_language", "en") self.target_languages = cfg.get("target_languages", ["fr", "de", "es"]) self.input_dir = cfg.get("input_dir", "xliff_in") self.output_dir = cfg.get("output_dir", "output") self.qa_report_dir = cfg.get("qa_report_dir", cfg.get("qa_report", "qa_reports")) self.model_template = cfg.get("model_template", "Helsinki-NLP/opus-mt-{src}-{tgt}") self.generate_qa_only = cfg.get("generate_qa_only", False) self.max_workers = cfg.get("max_workers", 4) # 并行线程数 self.batch_size = cfg.get("batch_size", 50) # 每批处理单元数 # 创建输出目录 Path(self.output_dir).mkdir(exist_ok=True) Path(self.qa_report_dir).mkdir(exist_ok=True) logging.info("配置加载成功") except FileNotFoundError: logging.error(f"配置文件 {config_path} 未找到!") raise except Exception as e: logging.error(f"加载配置失败: {e}") raise class GlossaryManager: _cache = {} @classmethod def get_glossary(cls, glossary_path="glossary.json"): if glossary_path not in cls._cache: try: if os.path.exists(glossary_path) and os.path.getsize(glossary_path) > 0: with open(glossary_path, "r", encoding="utf-8") as f: cls._cache[glossary_path] = json.load(f) or {} else: logging.warning(f"术语表 {glossary_path} 不存在或为空") cls._cache[glossary_path] = {} except Exception as e: logging.error(f"加载术语表失败: {e}") cls._cache[glossary_path] = {} return cls._cache[glossary_path] # -------------------------------------------------- # 模型管理 - 添加内存清理 # -------------------------------------------------- class ModelManager: def __init__(self, model_template: str): self.model_cache = {} self.model_template = model_template self.supported_langs = {"en", "fr", "de", "es", "ar", "pt", "ja", "ru", "zh"} self.lock = threading.Lock() def is_supported(self, src_lang: str, tgt_lang: str) -> bool: return src_lang in self.supported_langs and tgt_lang in self.supported_langs def get_model(self, src_lang: str, tgt_lang: str): key = f"{src_lang}-{tgt_lang}" with self.lock: if key not in self.model_cache: model_name = self.model_template.format(src=src_lang, tgt=tgt_lang) logging.info(f"加载模型: {model_name}") tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) self.model_cache[key] = {"tokenizer": tokenizer, "model": model} logging.info(f"模型 {model_name} 加载成功") return self.model_cache[key] def clear_unused_models(self, active_langs: set): """清理不再使用的模型""" with self.lock: keys_to_remove = [ key for key in self.model_cache if key.split("-")[1] not in active_langs ] for key in keys_to_remove: logging.info(f"清理模型: {key}") del self.model_cache[key] # -------------------------------------------------- # 内容保护优化 - 使用UUID避免冲突 # -------------------------------------------------- def protect_content(text: str) -> Tuple[str, Dict[str, str]]: if not text: return text, {} protected = text tag_map = {} # 短代码保护 def replace_callback(match): nonlocal tag_map sc = match.group(0) key = f"__SC_{uuid.uuid4().hex[:6]}__" tag_map[key] = sc return key protected = re.sub(r'\[[^\]]+\]', replace_callback, protected) # HTML标签保护 def replace_html(match): nonlocal tag_map tag = match.group(0) key = f"__HTML_{uuid.uuid4().hex[:6]}__" tag_map[key] = tag return key protected = re.sub(r'<[^>]+>', replace_html, protected) return protected, tag_map # -------------------------------------------------- # 翻译优化 - 添加批处理和缓存 # -------------------------------------------------- class TranslationCache: def __init__(self): self.cache = {} self.hits = 0 self.misses = 0 def get(self, src_text: str, tgt_lang: str) -> Optional[str]: key = f"{tgt_lang}:{src_text.strip()}" return self.cache.get(key) def set(self, src_text: str, tgt_lang: str, translation: str): key = f"{tgt_lang}:{src_text.strip()}" self.cache[key] = translation def stats(self): total = self.hits + self.misses return f"Cache: {self.hits}/{total} hits ({self.hits/max(1,total)*100:.1f}%)" def translate_text_batch( batch: List[Tuple[str, str]], model_manager: ModelManager, glossary: dict, cache: TranslationCache, src_lang: str = "en" ) -> List[str]: """批量翻译文本""" results = [] translations_to_do = [] indices = [] # 第一步:检查缓存和术语表 for i, (src_text, tgt_lang) in enumerate(batch): src_stripped = src_text.strip() # 1. 检查缓存 cached = cache.get(src_stripped, tgt_lang) if cached is not None: results.append(cached) cache.hits += 1 continue # 2. 术语表优先 if src_stripped in glossary and tgt_lang in glossary[src_stripped]: trans = glossary[src_stripped][tgt_lang] results.append(trans) cache.set(src_stripped, tgt_lang, trans) continue # 3. 标记需要翻译 translations_to_do.append((i, src_text, tgt_lang)) results.append(None) # 占位符 cache.misses += 1 # 如果没有需要翻译的文本,直接返回 if not translations_to_do: return results # 按目标语言分组批量翻译 lang_groups = {} for idx, src_text, tgt_lang in translations_to_do: if tgt_lang not in lang_groups: lang_groups[tgt_lang] = [] lang_groups[tgt_lang].append((idx, src_text)) # 第二步:批量翻译 for tgt_lang, texts in lang_groups.items(): # 检查语言支持 if not model_manager.is_supported(src_lang, tgt_lang): logging.warning(f"不支持语言对 {src_lang}->{tgt_lang},使用原文") for idx, src_text in texts: results[idx] = src_text cache.set(src_text, tgt_lang, src_text) continue # 获取模型 model_pack = model_manager.get_model(src_lang, tgt_lang) tokenizer = model_pack["tokenizer"] model = model_pack["model"] # 准备输入 inputs = [] protected_map = {} for idx, src_text in texts: protected, tag_map = protect_content(src_text) inputs.append(protected) protected_map[idx] = (src_text, tag_map) try: # 批量翻译 tokenized = tokenizer(inputs, return_tensors="pt", padding=True, truncation=True, max_length=512) outputs = model.generate(**tokenized) translations = tokenizer.batch_decode(outputs, skip_special_tokens=True) # 处理结果 for (idx, src_text), translated in zip(texts, translations): # 还原保护内容 _, tag_map = protected_map[idx] restored = restore_content(translated, tag_map) # 修复标签完整性 fixed = fix_tag_integrity(restored) # 有效性检查 if not is_valid_translation(fixed): logging.warning(f"翻译结果无效,使用源文本: {src_text[:50]}...") fixed = src_text results[idx] = fixed cache.set(src_text, tgt_lang, fixed) except Exception as e: logging.error(f"批量翻译失败: {str(e)[:100]}") # 失败时使用源文本 for idx, src_text in texts: results[idx] = src_text cache.set(src_text, tgt_lang, src_text) return results # -------------------------------------------------- # XLIFF 处理优化 - 并行处理 # -------------------------------------------------- def process_xliff_file(file_path: Path, model_manager: ModelManager, glossary: dict, config: ConfigManager): try: logging.info(f"开始处理文件: {file_path.name}") parser = etree.XMLParser(remove_blank_text=True) tree = etree.parse(str(file_path), parser) root = tree.getroot() ns = root.nsmap.get(None, "urn:oasis:names:tc:xliff:document:1.2") tool_ns = root.nsmap.get("tool", "https://cdn.wpml.org/xliff/custom-attributes.xsd") file_elem = root.find(f".//{{{ns}}}file") if file_elem is None: logging.warning(f"文件 {file_path.name} 无 <file> 标签,跳过") return src_lang = file_elem.get("source-language") or config.source_language tgt_lang = file_elem.get("target-language") or ( config.target_languages[0] if isinstance(config.target_languages, list) else config.target_languages ) if src_lang == tgt_lang: logging.info(f"源语言与目标语言相同({src_lang}),跳过文件 {file_path.name}") return # 准备QA报告 qa_root = build_qa_root_like(root) qa_file = etree.SubElement(qa_root, "file", attrib=file_elem.attrib) qa_body = etree.SubElement(qa_file, "body") # 获取所有翻译单元 trans_units = file_elem.findall(f".//{{{ns}}}body/{{{ns}}}trans-unit") total_units = len(trans_units) if total_units == 0: logging.warning(f"文件 {file_path.name} 无翻译单元,跳过") return logging.info(f"找到 {total_units} 个翻译单元,目标语言: {tgt_lang}") write_translated = not config.generate_qa_only # 初始化翻译缓存 translation_cache = TranslationCache() # 分批处理翻译单元 for batch_start in range(0, total_units, config.batch_size): batch_end = min(batch_start + config.batch_size, total_units) current_batch = trans_units[batch_start:batch_end] # 准备批量翻译数据 batch_data = [] for tu in current_batch: source_el = tu.find(f"{{{ns}}}source") if source_el is None or not source_el.text: batch_data.append(("", tgt_lang)) else: batch_data.append((source_el.text, tgt_lang)) # 批量翻译 translations = translate_text_batch( batch_data, model_manager, glossary, translation_cache, src_lang ) # 处理结果 for tu, translated_text in zip(current_batch, translations): source_el = tu.find(f"{{{ns}}}source") src_text = source_el.text if source_el else "" # 质量检查 issues = check_consistency(src_text, translated_text) if write_translated: target_el = ensure_target_element(tu, ns, tgt_lang) target_el.text = translated_text if issues: tu_id = tu.get("id") or f"unit-{batch_start}" qa_tu = etree.SubElement(qa_body, "trans-unit", id=tu_id) extradata = tu.find(f"{{{tool_ns}}}extradata") if extradata is not None: qa_ex = etree.SubElement(qa_tu, f"{{{tool_ns}}}extradata", attrib=extradata.attrib) qa_ex.text = extradata.text etree.SubElement(qa_tu, "source").text = src_text t = etree.SubElement(qa_tu, "target") t.set(f"{{{XML_NS}}}lang", tgt_lang) t.set("state", "needs-review-translation") t.text = translated_text etree.SubElement(qa_tu, "note").text = "; ".join(issues) logging.info(f"进度: {batch_end}/{total_units} | {translation_cache.stats()}") # 保存结果 out_dir = Path(config.output_dir) out_dir.mkdir(exist_ok=True) if write_translated: out_file = out_dir / f"{file_path.stem}_translated.xliff" tree.write(str(out_file), encoding="utf-8", xml_declaration=True, pretty_print=True) logging.info(f"已保存翻译文件: {out_file}") # 保存QA报告 if len(qa_body) > 0: qa_dir = Path(config.qa_report_dir) qa_dir.mkdir(exist_ok=True) qa_file_out = qa_dir / f"{file_path.stem}_qa.xliff" qa_tree = etree.ElementTree(qa_root) qa_tree.write(str(qa_file_out), encoding="utf-8", xml_declaration=True, pretty_print=True) logging.info(f"QA 文件已保存至 {qa_file_out}") except Exception as e: logging.error(f"处理文件 {file_path.name} 出错: {e}") # -------------------------------------------------- # 主程序优化 - 并行文件处理 # -------------------------------------------------- def main(): setup_logging() parser = argparse.ArgumentParser(description="处理 XLIFF 文件的翻译") parser.add_argument("--config", default="config.yaml", help="配置文件路径") args = parser.parse_args() # 加载配置 config = ConfigManager(args.config) glossary = GlossaryManager.get_glossary() model_manager = ModelManager(config.model_template) # 获取输入文件 input_dir = Path(config.input_dir) xliff_files = list(input_dir.glob("*.xliff")) if not xliff_files: logging.warning(f"在目录 {input_dir} 中没有找到任何XLF文件") return # 并行处理文件 logging.info(f"开始处理 {len(xliff_files)} 个文件,使用 {config.max_workers} 个工作线程") with concurrent.futures.ThreadPoolExecutor(max_workers=config.max_workers) as executor: futures = { executor.submit(process_xliff_file, file, model_manager, glossary, config): file for file in xliff_files } for future in concurrent.futures.as_completed(futures): file = futures[future] try: future.result() logging.info(f"完成处理: {
09-01
将以下脚本 生成绿色免安装exe文件 并部分内容可修改 位置 如文件夹位置 列内容等 """ 优化脚本内容 减少体积 文件名规则:工序明细号 工程名称中提取的五位文字 """ import pandas as pd import xlwt import sys import os import glob import logging import re import time import math def setup_logging(): """设置日志""" log_dir = os.path.join(os.getcwd(), "处理日志") if not os.path.exists(log_dir): os.makedirs(log_dir) timestamp = time.strftime("%Y%m%d_%H%M%S") log_file = os.path.join(log_dir, f"excel_processor_{timestamp}.log") logging.getLogger().handlers.clear() logging.basicConfig( level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[ logging.FileHandler(log_file, encoding='utf-8'), logging.StreamHandler(sys.stdout) ] ) return logging.getLogger(__name__) logger = setup_logging() def read_smart_file(file_path): """智能读取Excel/HTML文件""" file_name = os.path.basename(file_path) logger.info(f"读取: {file_name}") try: with open(file_path, 'rb') as f: file_start = f.read(500).decode('utf-8', errors='ignore').lower() if any(tag in file_start for tag in ['<!doctype', '<html', '<table', '<tr>', '<td>']): logger.info("检测到HTML格式") encodings = ['utf-8', 'gbk', 'gb2312'] for encoding in encodings: try: dfs = pd.read_html(file_path, encoding=encoding) if dfs: df = dfs[0] if len(df) > 0 and df.iloc[0].astype(str).str.contains('设备编码|订单号|需求日期').any(): df.columns = df.iloc[0] df = df[1:].reset_index(drop=True) return df except: continue raise ValueError("无法读取HTML文件") else: try: return pd.read_excel(file_path, engine='xlrd') except: return pd.read_excel(file_path, engine='openpyxl') except Exception as e: logger.error(f"读取失败: {e}") raise def get_cell_display_width(text): """计算单元格显示宽度""" if not text: return 0 width = 0 for char in str(text): if '\u4e00' <= char <= '\u9fff': width += 2 else: width += 1 return width def get_filename_from_data(df, input_file): """从数据中提取文件名(新规则:工序明细号 工程名称前五个字)""" process_no = "" project_name = "" if '工序明细号' in df.columns: process_series = df['工序明细号'].dropna() if not process_series.empty: process_no = str(process_series.iloc[0]).strip() if '工程名称' in df.columns: project_series = df['工程名称'].dropna() if not project_series.empty: project_name = str(project_series.iloc[0]).strip() # 提取工程名称中的五位文字 if project_name: # 移除空格和特殊字符,只保留中文文字 clean_name = re.sub(r'[^\u4e00-\u9fff]', '', project_name) if len(clean_name) >= 5: project_short = clean_name[:5] elif clean_name: project_short = clean_name else: # 如果没有中文字符,取前5个字符 project_short = project_name[:5] if len(project_name) >= 5 else project_name else: project_short = "" # 构建文件名:工序明细号 工程名称前五个字 if process_no and project_short: base_name = f"{process_no} {project_short}" elif project_short: base_name = project_short elif process_no: base_name = process_no else: base_name = os.path.splitext(os.path.basename(input_file))[0] base_name = re.sub(r'[\\/*?:"<>|]', '', base_name) return base_name def format_numeric_value(value): """格式化数值,如果是数值类型且是浮点数,则转换为整数(去掉小数点后的部分)""" if pd.isna(value): return None try: # 尝试转换为数值 num_value = float(value) # 如果是浮点数且没有小数部分,转换为整数 if num_value.is_integer(): return int(num_value) else: # 否则直接去掉小数部分(向下取整) return int(math.floor(num_value)) except (ValueError, TypeError): # 如果不是数值类型,返回原值 return value def create_excel_with_format(df_selected, available_cols, output_file, combined_header): """创建格式化的Excel文件""" wb = xlwt.Workbook(encoding='utf-8') ws = wb.add_sheet('处理结果') # 样式定义 borders = xlwt.Borders() borders.left = borders.right = borders.top = borders.bottom = xlwt.Borders.THIN title_style = xlwt.XFStyle() title_font = xlwt.Font() title_font.name = '宋体' title_font.bold = True title_font.height = 280 title_style.font = title_font title_alignment = xlwt.Alignment() title_alignment.horz = xlwt.Alignment.HORZ_CENTER title_alignment.vert = xlwt.Alignment.VERT_CENTER title_alignment.wrap = 0 title_style.alignment = title_alignment header_style = xlwt.XFStyle() header_font = xlwt.Font() header_font.name = '宋体' header_font.bold = True header_font.height = 220 header_style.font = header_font header_alignment = xlwt.Alignment() header_alignment.horz = xlwt.Alignment.HORZ_CENTER header_alignment.vert = xlwt.Alignment.VERT_CENTER header_alignment.wrap = 0 header_style.alignment = header_alignment header_style.borders = borders data_style = xlwt.XFStyle() data_font = xlwt.Font() data_font.name = '宋体' data_font.height = 200 data_style.font = data_font data_alignment = xlwt.Alignment() data_alignment.horz = xlwt.Alignment.HORZ_CENTER data_alignment.vert = xlwt.Alignment.VERT_CENTER data_alignment.wrap = 0 data_style.alignment = data_alignment data_style.borders = borders # 整数样式 - 用于数值列,不显示小数部分 integer_style = xlwt.XFStyle() integer_font = xlwt.Font() integer_font.name = '宋体' integer_font.height = 200 integer_style.font = integer_font integer_alignment = xlwt.Alignment() integer_alignment.horz = xlwt.Alignment.HORZ_CENTER integer_alignment.vert = xlwt.Alignment.VERT_CENTER integer_alignment.wrap = 0 integer_style.alignment = integer_alignment integer_style.borders = borders integer_style.num_format_str = '0' # 设置格式为整数,不显示小数 # 计算列宽(改进版) col_widths = {} for col_idx, col_name in enumerate(available_cols): max_width = get_cell_display_width(str(col_name)) # 检查该列的所有数据 for row_idx in range(len(df_selected)): cell_value = df_selected.iloc[row_idx, col_idx] if not pd.isna(cell_value): cell_width = get_cell_display_width(str(cell_value)) if cell_width > max_width: max_width = cell_width # 设置合理的列宽:最小8,最大40 col_width = max(8, min(max_width + 2, 40)) col_widths[col_idx] = col_width # 写入标题 ws.write_merge(0, 0, 0, len(available_cols)-1, combined_header, title_style) # 第一行行高设置为50(xlwt中1/20点,所以50*20=1000) ws.row(0).height = 1000 # 50点 # 写入表头 for col_idx, col_name in enumerate(available_cols): ws.write(1, col_idx, col_name, header_style) # 设置列宽(xlwt中256为单位) ws.col(col_idx).width = 256 * col_widths[col_idx] # 第二行行高设置为25(25*20=500) ws.row(1).height = 500 # 25点 # 定义需要转换为整数的列 integer_columns = ['自由项2', '库存数量', '未领数量', '出库数量'] # 写入数据行 if len(df_selected) > 0: for row_idx, row in enumerate(df_selected.itertuples(index=False), 2): for col_idx, value in enumerate(row): col_name = available_cols[col_idx] if pd.isna(value): cell_value = "" else: cell_value = str(value) # 检查是否是需要转换为整数的列 if col_name in integer_columns: formatted_value = format_numeric_value(value) if formatted_value is not None: # 使用整数样式写入 ws.write(row_idx, col_idx, formatted_value, integer_style) else: ws.write(row_idx, col_idx, "", integer_style) else: ws.write(row_idx, col_idx, cell_value, data_style) # 数据行行高设置为25 ws.row(row_idx).height = 500 # 25点 else: ws.write(2, 0, "(无数据)", data_style) ws.row(2).height = 500 # 写入汇总行 total_row = len(df_selected) + 2 # 汇总样式 summary_style = xlwt.XFStyle() summary_font = xlwt.Font() summary_font.name = '宋体' summary_font.bold = True summary_font.height = 200 summary_style.font = summary_font summary_alignment = xlwt.Alignment() summary_alignment.horz = xlwt.Alignment.HORZ_CENTER summary_alignment.vert = xlwt.Alignment.VERT_CENTER summary_alignment.wrap = 0 summary_style.alignment = summary_alignment summary_style.borders = borders # 汇总背景色(淡黄色) summary_pattern = xlwt.Pattern() summary_pattern.pattern = xlwt.Pattern.SOLID_PATTERN summary_pattern.pattern_fore_colour = 0x2C # 淡黄色 summary_style.pattern = summary_pattern # 整数汇总样式(用于整数列的汇总) integer_summary_style = xlwt.XFStyle() integer_summary_font = xlwt.Font() integer_summary_font.name = '宋体' integer_summary_font.bold = True integer_summary_font.height = 200 integer_summary_style.font = integer_summary_font integer_summary_alignment = xlwt.Alignment() integer_summary_alignment.horz = xlwt.Alignment.HORZ_CENTER integer_summary_alignment.vert = xlwt.Alignment.VERT_CENTER integer_summary_alignment.wrap = 0 integer_summary_style.alignment = integer_summary_alignment integer_summary_style.borders = borders integer_summary_style.pattern = summary_pattern integer_summary_style.num_format_str = '0' # 设置格式为整数 # 计算汇总 numeric_columns = ['库存数量', '未领数量', '出库数量'] summary_values = {} for col in numeric_columns: if col in df_selected.columns: try: df_selected[col] = pd.to_numeric(df_selected[col], errors='coerce') # 汇总值也转换为整数 summary_values[col] = format_numeric_value(df_selected[col].sum(skipna=True)) except: summary_values[col] = 0 # 自由项2不计算汇总,但也要使用整数格式 if '自由项2' in df_selected.columns: # 自由项2在汇总行留空 pass # 汇总标签放在第8列(索引7),如果列数不足则放在最后一列 summary_col_idx = 7 if len(available_cols) > 7 else len(available_cols) - 1 # 写入汇总标签 ws.write(total_row, summary_col_idx, "汇总", summary_style) # 写入汇总数值 for col_idx, col_name in enumerate(available_cols): if col_idx == summary_col_idx: continue if col_name in integer_columns: # 整数列使用整数汇总样式 if col_name in numeric_columns: # 数值列写入汇总值 ws.write(total_row, col_idx, float(summary_values.get(col_name, 0)), integer_summary_style) else: # 非数值整数列(如自由项2)留空 ws.write(total_row, col_idx, "", integer_summary_style) else: # 非整数列留空 ws.write(total_row, col_idx, "", summary_style) # 汇总行行高设置为25 ws.row(total_row).height = 500 # 25点 wb.save(output_file) def process_excel_data(input_file, output_folder): """处理Excel数据""" try: df = read_smart_file(input_file) logger.info(f"读取成功: {df.shape}行×{df.shape[1]}列") df.columns = [str(col).strip().replace('\n', '') for col in df.columns] required_columns = [ '设备编码', '订单号', '需求日期', '物料描述', '厂家模号', '自由项1', '自由项3', '自由项4', '自由项2', '库存数量', '未领数量', '出库数量', '物料编码', '批号' ] available_cols = [col for col in required_columns if col in df.columns] df_selected = df[available_cols].copy() # 对需要转换为整数的列进行预处理 integer_columns = ['自由项2', '库存数量', '未领数量', '出库数量'] for col in integer_columns: if col in df_selected.columns: try: # 转换为数值类型 df_selected[col] = pd.to_numeric(df_selected[col], errors='coerce') # 应用整数格式化 df_selected[col] = df_selected[col].apply(format_numeric_value) except Exception as e: logger.warning(f"列 {col} 转换为数值时出错: {e}") if '需求日期' in df_selected.columns: df_selected['需求日期'] = pd.to_datetime(df_selected['需求日期'], errors='coerce') df_selected['需求日期'] = df_selected['需求日期'].dt.strftime('%Y-%m-%d') # 提取标题 project_name = "" material_order = "" if '工程名称' in df.columns and len(df) > 0: project_name = str(df['工程名称'].iloc[0]) if not pd.isna(df['工程名称'].iloc[0]) else "" if '备料单号' in df.columns and len(df) > 0: material_order = str(df['备料单号'].iloc[0]) if not pd.isna(df['备料单号'].iloc[0]) else "" combined_header = f"{project_name} {material_order}".strip() or "表格处理结果" # 需求主数量汇总 if '需求主数量' in df.columns: try: total_demand = pd.to_numeric(df['需求主数量'], errors='coerce').sum() if isinstance(total_demand, float) and total_demand.is_integer(): total_demand_str = f"{int(total_demand)}kg" else: total_demand_str = f"{total_demand:.2f}kg" combined_header = f"{combined_header} {total_demand_str}" logger.info(f"需求主数量汇总: {total_demand_str}") except: pass # 数据排序 sort_cols = [col for col in ['自由项1', '自由项3', '自由项2'] if col in df_selected.columns] if sort_cols: df_selected = df_selected.sort_values(by=sort_cols, na_position='last') # 生成文件名(新规则) base_name = get_filename_from_data(df, input_file) # 避免重名 counter = 0 while True: if counter == 0: file_name = f"{base_name}.xls" else: file_name = f"{base_name}{counter}.xls" output_file = os.path.join(output_folder, file_name) if not os.path.exists(output_file): break counter += 1 logger.info(f"输出文件: {file_name}") # 创建Excel create_excel_with_format(df_selected, available_cols, output_file, combined_header) print(f"✓ 成功: {file_name} ({len(df_selected)}行)") print(f" 输出格式: .xls (Excel 97-2003兼容)") print(f" 文件名: {file_name}") print(f" 特殊处理: 自由项2、库存数量、未领数量、出库数量不保留小数") return True except Exception as e: logger.error(f"处理失败: {e}") print(f"✗ 失败: {os.path.basename(input_file)} - {str(e)}") return False def main(): """主函数""" print("\n智能表格处理器") print("=" * 60) print("文件名规则: 工序明细号 工程名称前五个字") print("输出格式: .xls (Excel 97-2003兼容)") print("行高设置: 第一行50,其他行25") print("特殊处理: 自由项2、库存数量、未领数量、出库数量不保留小数") print("=" * 60) current_dir = os.getcwd() input_folder = os.path.join(current_dir, "待处理") output_folder = os.path.join(current_dir, "已处理") if not os.path.exists(input_folder): print(f"错误: 请在当前目录创建 '{os.path.basename(input_folder)}' 文件夹") print(f"当前目录: {current_dir}") input("\n按回车键退出...") return if not os.path.exists(output_folder): os.makedirs(output_folder) print(f"已创建输出文件夹: {output_folder}") print(f"输入文件夹: {input_folder}") print(f"输出文件夹: {output_folder}") print("-" * 60) patterns = ['*.xls', '*.xlsx', '*.xlsm'] files = [] for pattern in patterns: files.extend(glob.glob(os.path.join(input_folder, pattern))) if not files: print("提示: 待处理文件夹中没有支持的文件 (.xls, .xlsx, .xlsm)") print("请将文件放入 '待处理' 文件夹中") input("\n按回车键退出...") return print(f"找到 {len(files)} 个文件:") for i, file_path in enumerate(files, 1): file_name = os.path.basename(file_path) size_kb = os.path.getsize(file_path) / 1024 print(f" {i:2d}. {file_name} ({size_kb:.1f} KB)") print("\n开始处理...") print("-" * 60) processed = 0 failed = 0 failed_files = [] for input_file in files: success = process_excel_data(input_file, output_folder) if success: processed += 1 else: failed += 1 failed_files.append(os.path.basename(input_file)) print("-" * 40) print("\n" + "=" * 60) print("处理完成!") print("=" * 60) print(f"✅ 成功处理: {processed} 个文件") if failed > 0: print(f"❌ 处理失败: {failed} 个文件") print("失败文件:") for f in failed_files: print(f" • {f}") print(f"\n输出文件位置: {output_folder}") print("输出格式特点:") print(" 1. 📄 格式: .xls (Excel 97-2003兼容)") print(" 2. 📁 文件名: 工序明细号 工程名称前五个字") print(" 3. 📏 行高: 第一行50,其他行25") print(" 4. 🔧 列宽: 根据内容自动调整") print(" 5. 🟦 边框: 所有单元格都有边框") print(" 6. ⚖️ 需求主数量汇总值显示在标题行") print(" 7. 🔢 整数格式: 自由项2、库存数量、未领数量、出库数量不保留小数") print(f"\n详细日志位置: 处理日志/") print("=" * 60) input("\n按回车键退出程序...") if __name__ == "__main__": try: main() except KeyboardInterrupt: print("\n\n程序被用户中断") logger.info("程序被用户中断") except Exception as e: error_msg = f"程序发生致命错误: {str(e)}" print(f"\n{error_msg}") logger.critical(error_msg) input("\n按回车键退出程序...")
最新发布
12-08
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值