Chapter 15_1 require函数

  Lua提供了一个名为require的高层函数来加载模块,但这个函数只假设了关于模块的基本概念。

对于require而言,一个模块就是一段定义了一些值(函数或者包含函数的table)的代码。

为了加载一个模块,只需要简单调用require "modname" 。这个函数首先会查找package.loaded表,检查modname是否加载过。

如果加载过,require返回package.loaded[modname]中的值。

否则,它试着为模块寻找“加载器”。

首先require会搜寻Lua file 。如果找到lua文件,就调用loadfile加载,它的返回值是函数 即"加载器"。

如果找不到Lua文件,会寻找C库,如果找到,就调用package.loadlib加载,去寻找luaopen_modname标志函数。

每次找到加载器,require就用两个参数调用加载器:modname和一个在获取加载器过程中得到的参数(如果通过查找文件得到的加载器,这个额外参数就是文件名)。

如果加载器返回非空值,require将这个值赋给package.loaded[modname].否则也会为其设置true值,无论什么情况,require都会返回package.loaded[modname]的最终值。如果在加载或运行时有错误,或是无法为模块找到加载器,require都会抛出错误。

如果要用require加载模块两次,我们可以把package.loaded[modname]给擦除掉。

packag.loaded.[modname] = nil

有时需要讲一个模块改名,以避免名字冲突。比如同一个模块有不同版本的情况。对于一个lua文件来说,改掉名字就可以了。

但是对于一个C程序库,我们就不能编辑其中的luaopen_*函数的名称了。为此,require用到了一个小技巧:

如果一个模块中包含了连字符 " - " ,require就会用连字符后的内容来创建luaopen_*函数名。

比如一个模块名为module-hello,require就会认为它的open函数是luaopen_hello,并不是luaopen_a-b

因为在C语法中,连字符作为函数名是非法的。

因此,如果我们需要用两个同名的mod模块,可以把其中一个重新命名为mod-hello,一个命名为mod-world。

当调用

m1 = require "mod-hello"

require函数会找到这个mod-hello文件,并且在这个文件里,去找luaopen_mod函数。

注意:Lua5.2与Lua5.3这里是完全相反:5.2移除"-"之前的,5.3移除“-”之后的。

Lua5.3的版本,如果遇到a.b.c-v2.1的模块名,函数名就是luaopen_a_b_c。横线后的都忽略掉

路径搜索

  当搜索一个Lua文件的时候,rquire用的的路径是一系列样板(templates),其中的每一个指定了模块的文件名。

在每个路径样板中都包含了一个可选的'?'符号,require会用模块名去替代这个问号(如果有的话),再去检查这个文件是否存在。

如果该样板没有,就从下一个样板继续寻找。这些路径样板被";"一个个隔开:

?;?.lua;c:\Windows\?;/usr/local/lua/?/?.lua

require "mod" 语句会寻找下面的文件:

mod
mod.lua
c:\Windows\mod
/usr/loacl/lua/mod/mod.lua

require只处理 ";" 和 "?" 符号,其他符合则是路径自己定义的。

require找lua文件时路径变量是package.path。Lua一开始会用LUA_PATH_5_3环境变量去初始化它。

如果没有定义这个环境变量,就会用LUA_PATH。如果两个都没有定义,Lua会用luaconf.h里的默认路径。

环境变量中出现的所有“;;”都会被替换成默认路径。

比如你把LUA_PATH_5_3设置为“mydir/?.lua;;”,最后的路径会是"mydir/?.lua"之后跟着的是默认路径。

require寻找C库的方法也是一样的,但是变量变成了package.cpath。初始化的时候使用的是LUA_CPATH_5_3和LUA_CPATH。

在UNIX中是这样的:

./?.so;/usr/local/lib/lua/5.3/?.so

package.searchpath函数包含了所有搜寻库的规则,它接收一个模块名和一个路径。

路径是一系列以分号分隔的模板构成的字符串,在其中搜索指定的name。

返回第一个可以用读模式打开(并马上关闭该文件)的文件的名字。如果不存在这样的文件,即返回nil加上错误消息。

该错误消息列出所有尝试打开的文件名。

搜寻器

  用于require控制如何加载模块的表--package.searchers

表内的每一项都是一个查找器函数,当查找一个模块时,require按照次序调用这些查找器,并传入唯一的模块名参数。

成功则返回另一个函数(模块的加载器)和一个加载器需要的参数。nil则表示错误。

Lua用四个查找器函数初始化该表:

1>  在package.preload表中查找加载器,没有返回值。

2>  查找Lua库的加载库,用存储在package.path中的路径来查找

3>  查找C库的加载库,使用存储在package.cpath中的路径。

4>  一体化加载器,从C路径中查找指定模块的根名字。例如“a.b.c”;它将查找a这个库。如果找到就会在里面找子模块的加载函数,就是找luaopen_a_b_c。

比如查找 “hello.core” 模块,就是寻找hello库的core子模块。对应的函数就是luaopen_hello_core。

这样可以把若干C子模块打包进单个库。每个子模块都可以有原本的加载函数名。

 以上内容来自:《Lua程序设计第二版》和《Programming in Lua  third edition 》

转载于:https://www.cnblogs.com/daiker/p/5865709.html

import base64 import urllib.parse import json import os import time import asyncio import aiohttp import random from tqdm import tqdm from ebooklib import epub import datetime import re class BQGNovelDownloader: def __init__(self, config=None): # 默认配置 - 间隔时间翻倍 self.config = { 'concurrency': 8, 'min_interval': 400, 'max_interval': 800, 'max_retries': 3, 'retry_min_interval': 4000, 'retry_max_interval': 8000 } # 更新用户配置 if config: self.config.update(config) self.headers = { "Accept": "*/*", "Accept-Language": "zh-CN,zh;q=0.9", "Connection": "keep-alive", "Referer": "https://bqg123.net/", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36", "sec-ch-ua": "\"Google Chrome\";v=\"141\", \"Not?A_Brand\";v=\"8\", \"Chromium\";v=\"141\"", "sec-ch-ua-mobile": "?0", "sec-ch-ua-platform": "\"Windows\"" } # 固定URL和params - 保持原来的格式方便修改 self.book_url = "https://bv-jp.booktt.cc/v3/load_book_info/27257389/972315.js" self.params = { "ws": "9191920", "tk": "0404" } self.session = None self.semaphore = asyncio.Semaphore(self.config['concurrency']) # 创建目录 self.save_dir = "saved_pages" self.epub_dir = "epub_books" self.log_dir = "download_logs" for directory in [self.save_dir, self.epub_dir, self.log_dir]: if not os.path.exists(directory): os.makedirs(directory) # 小说信息 self.novel_info = { 'title': '', 'author': '', 'description': '', 'cover_url': '' } async def create_session(self): """创建aiohttp会话""" if self.session is None: timeout = aiohttp.ClientTimeout(total=60) connector = aiohttp.TCPConnector(limit=self.config['concurrency']) self.session = aiohttp.ClientSession(timeout=timeout, connector=connector, headers=self.headers) async def close_session(self): """关闭会话""" if self.session: await self.session.close() self.session = None async def random_delay(self, min_ms, max_ms): """随机延迟""" delay = random.uniform(min_ms / 1000, max_ms / 1000) await asyncio.sleep(delay) async def fetch_with_retry(self, url, params=None): """带重试机制的异步请求""" for attempt in range(self.config['max_retries'] + 1): try: # 请求前延迟 await self.random_delay(self.config['min_interval'], self.config['max_interval']) async with self.session.get(url, params=params, headers=self.headers) as response: if response.status == 200: return await response.text() elif attempt < self.config['max_retries']: # 重试延迟 retry_delay = random.uniform( self.config['retry_min_interval'] / 1000, self.config['retry_max_interval'] / 1000 ) await asyncio.sleep(retry_delay) continue except Exception as e: if attempt < self.config['max_retries']: retry_delay = random.uniform( self.config['retry_min_interval'] / 1000, self.config['retry_max_interval'] / 1000 ) await asyncio.sleep(retry_delay) continue return None def decode_encoded_str(self, encoded_str): """解密编码字符串""" try: url_decoded = urllib.parse.unquote(encoded_str) base64_decoded = base64.b64decode(url_decoded) decoded_str = base64_decoded.decode('utf-8') final_decoded_str = urllib.parse.unquote(decoded_str) return json.loads(final_decoded_str) except Exception: return None async def get_book_info(self): """获取小说信息并选择数据源""" response_text = await self.fetch_with_retry(self.book_url, self.params) if not response_text: return None # 解密 book_info_str book_info_str = None lines = response_text.split('\n') for line in lines: if 'window[\'book_info_str\']' in line: start_idx = line.find('"') + 1 end_idx = line.rfind('"') if start_idx > 0 and end_idx > start_idx: book_info_str = line[start_idx:end_idx] break if not book_info_str: return None book_info = self.decode_encoded_str(book_info_str) if not book_info: return None # 保存小说信息 self.novel_info['title'] = book_info.get('book_name', '未知') self.novel_info['author'] = book_info.get('author', '未知') self.novel_info['description'] = book_info.get('intro', '') # 显示数据源信息 print(f"小说名称: {self.novel_info['title']}") print(f"作者: {self.novel_info['author']}") if self.novel_info['description']: print(f"简介: {self.novel_info['description'][:100]}...") print(f"数据源数量: {book_info.get('source_count', 1)}") # 主数据源 print(f"\n数据源 0:") print(f" 章节数: {book_info.get('chapter_count_source', '未知')}") # 其他数据源 other_sources = book_info.get('other_source', []) for i, source in enumerate(other_sources, 1): print(f"数据源 {i}:") print(f" 章节数: {source.get('chapter_count_getok', '未知')}") # 选择数据源 total_sources = 1 + len(other_sources) selected_source = 0 if total_sources > 1: choice = input(f"\n请选择数据源 (0-{total_sources - 1}, 默认0): ").strip() if choice: selected_source = int(choice) # 构建目录请求参数 if selected_source == 0: chapter_list_url = book_info.get('url_chapter_list_kv', '') time_param = book_info.get('time_chapter_list_kv', '') else: chapter_list_url = book_info.get('url_chapter_list_kv', '') time_param = other_sources[selected_source - 1].get('time_chapter_list_kv', other_sources[selected_source - 1].get('time_update', '')) return { 'book_info': book_info, 'chapter_list_url': chapter_list_url, 'time_param': time_param } async def get_chapter_list(self, chapter_list_url, time_param): """获取章节列表""" final_url = f"https://bv-jp.booktt.cc/load_chapter_list/{chapter_list_url}.js" params = {"t": str(time_param), "tk": "0404"} response_text = await self.fetch_with_retry(final_url, params) if not response_text: return None # 解密章节列表 start_marker = 'var chapter_list_data_str="' start_idx = response_text.find(start_marker) if start_idx == -1: return None start_idx += len(start_marker) end_idx = response_text.find('"', start_idx) if end_idx == -1: return None encoded_str = response_text[start_idx:end_idx] return self.decode_encoded_str(encoded_str) def save_decrypted_content(self, chapter_name, chapter_data): """保存解密内容到TXT文件""" # 清理文件名 safe_name = re.sub(r'[<>:"/\\|?*]', '_', chapter_name) file_path = os.path.join(self.save_dir, f"{safe_name}.txt") with open(file_path, 'w', encoding='utf-8') as f: f.write(f"章节名称: {chapter_name}\n") f.write("=" * 50 + "\n") f.write("完整解密数据:\n") f.write(json.dumps(chapter_data, ensure_ascii=False, indent=2)) f.write("\n" + "=" * 50 + "\n") # 提取正文内容 if 'chapter_kv' in chapter_data and 'content' in chapter_data['chapter_kv']: content_text = chapter_data['chapter_kv']['content'] if "通知:新站book4.cc" in content_text: content_text = content_text.split("通知:新站book4.cc")[0] content_text = content_text.strip() f.write("\n正文内容:\n") f.write("=" * 50 + "\n") f.write(content_text) def write_realtime_log(self, chapter_results): """实时写入下载日志""" log_filename = f"{self.novel_info['title']}_下载日志.txt" log_filename = re.sub(r'[<>:"/\\|?*]', '_', log_filename) log_path = os.path.join(self.log_dir, log_filename) with open(log_path, 'w', encoding='utf-8') as f: f.write(f"小说下载日志\n") f.write("=" * 50 + "\n") f.write(f"书名: {self.novel_info['title']}\n") f.write(f"作者: {self.novel_info['author']}\n") f.write(f"下载时间: {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n") f.write(f"EPUB文件: epub_books\\{self.novel_info['title']}({self.novel_info['author']}).epub\n") f.write("=" * 50 + "\n\n") # 成功章节 success_chapters = [r for r in chapter_results if r['success']] f.write(f"成功下载章节: {len(success_chapters)} 章\n") for i, result in enumerate(success_chapters, 1): f.write(f"{i}. {result['chapter']} (原: {result['original_title']})\n") # 失败章节 failed_chapters = [r for r in chapter_results if not r['success']] if failed_chapters: f.write(f"\n失败章节: {len(failed_chapters)} 章\n") for i, result in enumerate(failed_chapters, 1): f.write(f"{i}. {result['chapter']} - URL: {result['url']}\n") f.write(f"\n配置信息:\n") f.write(f"并发数: {self.config['concurrency']}\n") f.write(f"最大重试次数: {self.config['max_retries']}\n") f.write(f"请求间隔: {self.config['min_interval']}-{self.config['max_interval']}ms\n") async def download_chapter(self, chapter_info, pbar, chapter_results): """下载单个章节""" async with self.semaphore: chapter_name = chapter_info['name'] url_kv = chapter_info['url_kv'] chapter_len = chapter_info['len'] # 获取章节内容 url = f"https://dmit.xsjs.cc/load_chapter/{url_kv}.js" params = {"t": str(chapter_len), "tk": "0404"} response_text = await self.fetch_with_retry(url, params) if not response_text: chapter_results.append({ 'success': False, 'chapter': chapter_name, 'url': url, 'original_title': chapter_name }) # 实时更新日志 self.write_realtime_log(chapter_results) return None # 解密章节内容 start_marker = 'var chapter_data_str="' start_idx = response_text.find(start_marker) if start_idx == -1: chapter_results.append({ 'success': False, 'chapter': chapter_name, 'url': url, 'original_title': chapter_name }) self.write_realtime_log(chapter_results) return None start_idx += len(start_marker) end_idx = response_text.find('"', start_idx) if end_idx == -1: chapter_results.append({ 'success': False, 'chapter': chapter_name, 'url': url, 'original_title': chapter_name }) self.write_realtime_log(chapter_results) return None encoded_str = response_text[start_idx:end_idx] chapter_data = self.decode_encoded_str(encoded_str) if not chapter_data: chapter_results.append({ 'success': False, 'chapter': chapter_name, 'url': url, 'original_title': chapter_name }) self.write_realtime_log(chapter_results) return None # 保存完整解密内容到TXT文件 self.save_decrypted_content(chapter_name, chapter_data) # 提取正文内容 if 'chapter_kv' in chapter_data and 'content' in chapter_data['chapter_kv']: content_text = chapter_data['chapter_kv']['content'] # 清理内容 if "通知:新站book4.cc" in content_text: content_text = content_text.split("通知:新站book4.cc")[0] content_text = content_text.strip() else: chapter_results.append({ 'success': False, 'chapter': chapter_name, 'url': url, 'original_title': chapter_name }) self.write_realtime_log(chapter_results) return None chapter_results.append({ 'success': True, 'chapter': chapter_name, 'original_title': chapter_name, 'content': content_text }) # 实时更新日志 self.write_realtime_log(chapter_results) return { 'name': chapter_name, 'content': content_text } async def download_all_chapters(self, chapter_list): """下载所有章节""" chapters_data = [] chapter_results = [] # 用于实时日志 print("开始下载章节...") with tqdm(total=len(chapter_list), desc="下载进度", unit="章") as pbar: tasks = [] for chapter_info in chapter_list: task = asyncio.create_task(self.download_chapter(chapter_info, pbar, chapter_results)) tasks.append(task) # 使用asyncio.as_completed来实时更新进度条 for task in asyncio.as_completed(tasks): result = await task if result is not None: chapters_data.append(result) pbar.update(1) if result and result.get('name'): pbar.set_description(f"✓ {result['name'][:20]:<20}") return chapters_data, chapter_results def create_epub(self, chapters_data): """创建EPUB电子书""" book = epub.EpubBook() book.set_identifier(f'bqg_novel_{int(time.time())}') book.set_title(self.novel_info['title']) book.add_author(self.novel_info['author']) book.set_language('zh') # 添加简介章节 intro_chapter = epub.EpubHtml(title='简介', file_name='intro.xhtml', lang='zh') intro_content = f""" <html> <head> <title>简介</title> <style> body {{ font-family: Arial, sans-serif; line-height: 1.6; margin: 20px; }} h1 {{ color: #333; border-bottom: 2px solid #333; }} .description {{ margin: 20px 0; }} </style> </head> <body> <h1>{self.novel_info['title']}</h1> <h2>作者:{self.novel_info['author']}</h2> <div class="description"> {self.novel_info['description'].replace(chr(10), '<br/>')} </div> </body> </html> """ intro_chapter.content = intro_content book.add_item(intro_chapter) # 添加样式 style = ''' body { font-family: "Microsoft YaHei", sans-serif; font-size: 16px; line-height: 1.6; margin: 20px; text-align: justify; } h1 { text-align: center; margin-bottom: 40px; font-size: 24px; border-bottom: 1px solid #ccc; padding-bottom: 10px; } p { margin-bottom: 15px; text-indent: 2em; } ''' nav_css = epub.EpubItem(uid="style_nav", file_name="style/nav.css", media_type="text/css", content=style) book.add_item(nav_css) # 创建EPUB章节 chapters = [] for i, chapter_data in enumerate(chapters_data): content_paragraphs = chapter_data['content'].split('\n') formatted_content = ''.join( f'<p>{paragraph.strip()}</p>' for paragraph in content_paragraphs if paragraph.strip()) chapter_html = f''' <!DOCTYPE html> <html> <head> <title>{chapter_data['name']}</title> <link rel="stylesheet" type="text/css" href="../style/nav.css" /> </head> <body> <h1>{chapter_data['name']}</h1> <div>{formatted_content}</div> </body> </html> ''' epub_chapter = epub.EpubHtml( title=chapter_data['name'], file_name=f'chapter_{i + 1:04d}.xhtml', lang='zh' ) epub_chapter.content = chapter_html book.add_item(epub_chapter) chapters.append(epub_chapter) # 设置书籍结构 book.spine = ['nav', intro_chapter] + chapters book.toc = [epub.Link('intro.xhtml', '简介', 'intro')] + chapters book.add_item(epub.EpubNcx()) book.add_item(epub.EpubNav()) epub_filename = f"{self.novel_info['title']}({self.novel_info['author']}).epub" epub_filename = re.sub(r'[<>:"/\\|?*]', '_', epub_filename) output_path = os.path.join(self.epub_dir, epub_filename) epub.write_epub(output_path, book, {}) return output_path async def download_novel(self): """下载整本小说""" await self.create_session() print("获取小说信息...") book_data = await self.get_book_info() if not book_data: print("获取小说信息失败") return False print("获取章节列表...") chapter_list_data = await self.get_chapter_list( book_data['chapter_list_url'], book_data['time_param'] ) if not chapter_list_data: print("获取章节列表失败") return False total_chapters = len(chapter_list_data['chapter_list']) print(f"找到 {total_chapters} 个章节") # 下载所有章节 chapters_data, chapter_results = await self.download_all_chapters(chapter_list_data['chapter_list']) successful_chapters = len(chapters_data) failed_chapters = total_chapters - successful_chapters print(f"\n下载完成!") print(f"成功: {successful_chapters} 个章节") print(f"失败: {failed_chapters} 个章节") if successful_chapters == 0: print("没有成功下载任何章节") return False # 生成EPUB print("\n正在创建EPUB电子书...") epub_path = self.create_epub(chapters_data) print(f"✓ EPUB电子书已创建: {epub_path}") # 最终日志 log_filename = f"{self.novel_info['title']}_下载日志.txt" log_filename = re.sub(r'[<>:"/\\|?*]', '_', log_filename) log_path = os.path.join(self.log_dir, log_filename) print(f"✓ 下载日志已保存: {log_path}") # 清理临时文件 if failed_chapters == 0: print("\n所有章节下载成功,清理临时文件...") if os.path.exists(self.save_dir): import shutil shutil.rmtree(self.save_dir) print("✓ 临时文件已清理") print(f"\n最终结果:") print(f"EPUB电子书: {epub_path}") print(f"下载日志: {log_path}") await self.close_session() return True async def main(): print("=" * 50) print(" BQG小说下载器") print("=" * 50) # 获取用户配置 config = {} # 并发数 while True: try: concurrency_input = input("请输入并发数 (1-100,推荐3): ").strip() if not concurrency_input: config['concurrency'] = 8 break concurrency = int(concurrency_input) if 1 <= concurrency <= 100: config['concurrency'] = concurrency break else: print("并发数必须在1-100之间!") except ValueError: print("请输入有效的数字!") # 重试次数 while True: try: retries_input = input("请输入最大重试次数 (默认3): ").strip() if not retries_input: config['max_retries'] = 3 break retries = int(retries_input) if retries >= 0: config['max_retries'] = retries break else: print("重试次数必须大于等于0!") except ValueError: print("请输入有效的数字!") # 使用默认的时间间隔配置 config.update({ 'min_interval': 400, 'max_interval': 800, 'retry_min_interval': 4000, 'retry_max_interval': 8000 }) print(f"\n配置信息:") print(f"并发数: {config['concurrency']}") print(f"最大重试次数: {config['max_retries']}") print(f"请求间隔: {config['min_interval']}-{config['max_interval']}ms") print(f"重试间隔: {config['retry_min_interval']}-{config['retry_max_interval']}ms") print("-" * 50) # 创建下载器并开始下载 downloader = BQGNovelDownloader(config=config) success = await downloader.download_novel() if success: print("\n✓ 下载完成!") else: print("\n✗ 下载失败!") if __name__ == "__main__": asyncio.run(main()) 以这个网站爬虫为示例,告诉我怎么改成直接JS引擎调用来实现爬虫
11-05
import logging,time,re import numpy as np from collections import defaultdict from itertools import zip_longest from kotei_omp.data import DocumentBlockObject from kotei_omc.comparers.picture_comparer import PictureComparer, GraphicComparer from kotei_omc.comparers.base_comparer import BaseComparer from kotei_omc.comparers.plugins import register_plugin from kotei_omc.data.diff import DiffItem from kotei_omp.data import TextObject, GraphicObject, PictureObject, StyleObject, RunObject from kotei_omp.data.table import CellObject, RowObject, TableObject from kotei_omc.settings import settings from kotei_omc.utils.type_checker import is_instance_of from kotei_omc.middlewares.table_middlewares import CustomTableStrategyMiddleware logger = logging.getLogger("req_diff") @register_plugin("table") class TableComparer(BaseComparer): def get_block_resource(self, block, belong_to='block'): return self.do_get_block_resource(block, belong_to, 'tables', TableObject) def compare(self, block_name, base, target, belong_to=None): t0 = time.time() # 表格匹配 logger.info(f'start match table, block_name: {block_name}, base_num: {len(base)}, target_num: {len(target)}') match_func = CustomTableStrategyMiddleware(self._path_base).match if settings.MATCH_WITH_CHAPTER: tb_delete_list, tb_add_list, old_new_tb_matched = self.do_match_with_chapter(base, target,match_func) else: tb_delete_list, tb_add_list, old_new_tb_matched = self.do_match_normal(base, target,match_func) logger.info('finish match table') # 表格新增删除 ls_tb_delete, ls_tb_add = self.process_delete_add_diff(block_name, 'table', tb_delete_list, tb_add_list, belong_to=belong_to) # 表格差分 ls_tb_update = [] for old_table, new_table in old_new_tb_matched: # 要求废止特殊处理 old_table, new_table = self.pre_process_require(old_table, new_table) # 表格位置差分 if not old_table.is_same_pos(new_table): ls_tb_update.append(DiffItem('update', 'table', sub_type='table', block_name=block_name, old=old_table, new=new_table, belong_to=belong_to,diff_point='coordinate_desc')) # 对匹配的每个表格进行对比 part_delete, part_add, part_update = self.compare_table(block_name, old_table, new_table,belong_to=belong_to) ls_tb_delete.extend(self.row_del_add_after(part_delete,category='delete')) ls_tb_add.extend(self.row_del_add_after(part_add,category='add')) ls_tb_update.extend(self.cell_update_after(part_update)) t1 = time.time() logger.info(f'Time Cost:table diff {block_name} {t1 - t0}') return {'add': ls_tb_add, 'delete': ls_tb_delete, 'update': ls_tb_update} @staticmethod def copy_table_attrs(to_table, from_table): for attr_name in ('layout', 'style', 'border', 'coordinate', 'data_id'): setattr(to_table, attr_name, getattr(from_table, attr_name)) @staticmethod def fill_visual_merged_cells(table): num_rows = len(table.rows) if num_rows == 0: return num_cols = max([len(row.cells) for row in table.rows]) if num_cols == 0: return # 判断是否有边界 def is_bordered(side): return side.border_style is not None for col in range(num_cols): row_ptr = 0 while row_ptr < num_rows: cell = table.rows[row_ptr].cells[col] top_border_exists = is_bordered(cell.border.border_top) if row_ptr == 0 or top_border_exists: start_row = row_ptr end_row = start_row while end_row < num_rows: current_cell = table.rows[end_row].cells[col] bottom_border_exists = is_bordered(current_cell.border.border_bottom) # import ipdb;ipdb.set_trace() if bottom_border_exists or end_row == num_rows - 1: break else: end_row += 1 block_text = None block_content = None for r in range(start_row, end_row + 1): val = table.rows[r].cells[col].text if val is not None and str(val).strip() != "": block_text = val block_content = table.rows[r].cells[col].content break if block_text is not None: merged_ranges = [start_row, col, end_row, col] for r in range(start_row, end_row + 1): val = table.rows[r].cells[col].text if val is None or str(val).strip() == "": table.rows[r].cells[col].content = block_content table.rows[r].cells[col].text = block_text # 添加 merged_ranges 属性 if not table.rows[r].cells[col].merged_ranges: table.rows[r].cells[col].merged_ranges = merged_ranges row_ptr = end_row + 1 else: row_ptr += 1 def compare_table(self, block_name, old_table, new_table, belong_to): logger.info(f"start compare table, old_data_id: {old_table.data_id}, new_data_id: {new_table.data_id}") # === 设置默认表头类型 === DEFAULT_HEADER = 'horizontal' if not hasattr(old_table, 'head_type') or old_table.head_type is None: old_table.head_type = DEFAULT_HEADER if not hasattr(new_table, 'head_type') or new_table.head_type is None: new_table.head_type = DEFAULT_HEADER # === 表头有效性检查函数 === def is_valid_header(row): if not row or not row.cells: return False total_cells = len(row.cells) empty_cells = sum(1 for cell in row.cells if (cell.text is None or str(cell.text).strip() == '') and not cell.content) return (empty_cells / total_cells) < 0.1 # === 关键修复:设置实际表头行 === # 设置默认表头行为第一行 old_table.header_row_idx = 0 new_table.header_row_idx = 0 # 检查并确认表头行 if old_table.head_type == 'horizontal' and old_table.rows: # 检查第一行是否有效 if not is_valid_header(old_table.rows[0]): logger.warning(f"第一行无效表头 in old table {old_table.data_id}") # 尝试查找后续有效行作为表头 for idx in range(1, min(3, len(old_table.rows))): # 最多检查前3行 if is_valid_header(old_table.rows[idx]): old_table.header_row_idx = idx logger.info(f"设置第{idx + 1}行为表头 in old table") break else: old_table.head_type = None # 未找到有效表头 logger.warning(f"未找到有效表头 in old table {old_table.data_id}") # 新表同样处理 if new_table.head_type == 'horizontal' and new_table.rows: if not is_valid_header(new_table.rows[0]): logger.warning(f"第一行无效表头 in new table {new_table.data_id}") for idx in range(1, min(3, len(new_table.rows))): if is_valid_header(new_table.rows[idx]): new_table.header_row_idx = idx logger.info(f"设置第{idx + 1}行为表头 in new table") break else: new_table.head_type = None logger.warning(f"未找到有效表头 in new table {new_table.data_id}") # === 设置表头内容 === # 从确定的表头行提取表头内容 if old_table.head_type == 'horizontal' and old_table.rows: header_row = old_table.rows[old_table.header_row_idx] old_table.head_list = [cell.text for cell in header_row.cells] if new_table.head_type == 'horizontal' and new_table.rows: header_row = new_table.rows[new_table.header_row_idx] new_table.head_list = [cell.text for cell in header_row.cells] # 使表格内列数一致 self.align_table_col(old_table, new_table) # 表格中存在大量视觉上merge但是实际未合并的空格,需要将空格赋值为正确的文本,防止影响相似度匹配 self.fill_visual_merged_cells(old_table) self.fill_visual_merged_cells(new_table) if old_table.head_type == new_table.head_type == 'horizontal': old_col_table, new_col_table = self.transpose_table(old_table, new_table) else: if old_table.head_type == 'vertical': new_table.head_list = old_table.head_list new_table.head_type = 'vertical' elif new_table.head_type == 'vertical': old_table.head_list = new_table.head_list old_table.head_type = 'vertical' old_col_table, new_col_table = old_table, new_table # 列匹配 del_cols, add_cols = old_col_table.rows, new_col_table.rows col_matched = CustomTableStrategyMiddleware(self._path_base).match_row(del_cols, add_cols,is_col=True, head_indexes=[old_table.head_list,new_table.head_list]) if col_matched: matched_old_cols, matched_new_cols = list(zip(*list(col_matched))) del_cols = [old_col for old_col in old_col_table.rows if old_col not in matched_old_cols] add_cols = [new_col for new_col in new_col_table.rows if new_col not in matched_new_cols] sub_type = 'col' if old_table.head_type == 'horizontal' else 'row' ls_col_delete, ls_col_add = self.process_delete_add_diff(block_name, sub_type, del_cols, add_cols, belong_to=belong_to, head_type=old_table.head_type) # 根据matched的列组合新的表,得到列一致的两个表 if col_matched: old_col_indexes,new_col_indexes =[],[] for old_col, new_col in col_matched: old_col_indexes.append(old_col_table.rows.index(old_col)) new_col_indexes.append(new_col_table.rows.index(new_col)) old_equal_col_table = self.choice_cols(old_table, old_col_indexes) new_equal_col_table = self.choice_cols(new_table, new_col_indexes) else: return ls_col_delete, ls_col_add, [] # 行匹配 del_rows, add_rows = old_equal_col_table.rows, new_equal_col_table.rows row_matched = CustomTableStrategyMiddleware(self._path_base).match_row(del_rows, add_rows, is_col=False) if row_matched: matched_old_rows, matched_new_rows = list(zip(*list(row_matched))) del_rows_indexes = [idx for idx, old_row in enumerate(old_equal_col_table.rows) if old_row not in matched_old_rows] add_rows_indexes = [idx for idx, new_row in enumerate(new_equal_col_table.rows) if new_row not in matched_new_rows] # 使用没有重组前的表,横表头直接处理,竖表头需要转置 if old_table.head_type == new_table.head_type == 'horizontal': del_rows = [old_table.rows[idx] for idx in del_rows_indexes] add_rows = [new_table.rows[idx] for idx in add_rows_indexes] else: old_transpose_table = self.choice_cols(old_table, list(range(len(old_table.rows)))) new_transpose_table = self.choice_cols(new_table, list(range(len(new_table.rows)))) del_rows = [old_transpose_table.rows[idx] for idx in del_rows_indexes] add_rows = [new_transpose_table.rows[idx] for idx in add_rows_indexes] sub_type = 'row' if old_table.head_type == 'horizontal' else 'col' ls_row_delete, ls_row_add = self.process_delete_add_diff(block_name, sub_type, del_rows, add_rows, belong_to=belong_to, head_type=old_table.head_type) # 根据matched的行组合新的表,得到行一致的两个表 if row_matched: old_equal_row_table, new_equal_row_table = TableObject(), TableObject() old_equal_row_table.rows = list(matched_old_rows) old_equal_row_table.head_type = old_table.head_type self.copy_table_attrs(old_equal_row_table, old_table) new_equal_row_table.rows = list(matched_new_rows) new_equal_row_table.head_type = new_table.head_type self.copy_table_attrs(new_equal_row_table, new_table) # 查找行变更、列变更、单元格变更 ls_row_update, ls_col_update, ls_cell_update = self.compare_ordered_tables(block_name,old_equal_row_table, new_equal_row_table,belong_to=belong_to) else: ls_row_update, ls_col_update, ls_cell_update = [], [], [] part_delete = ls_row_delete + ls_col_delete part_add = ls_row_add + ls_col_add part_update = ls_row_update + ls_col_update + ls_cell_update logger.info(f"finish compare table, old_data_id: {old_table.data_id}, new_data_id: {new_table.data_id}") return part_delete, part_add, part_update def transpose_table(self, old_table, new_table): """ 将表格进行转置操作,即将行转换为列,列转换为行。 Args: old_table (TableObject): 原始表格对象 new_table (TableObject): 目标表格对象 Returns: tuple: 返回转置后的两个表格对象 (old_col_table, new_col_table) """ # 创建新的表格对象用于存储转置后的数据 old_col_table, new_col_table = TableObject(), TableObject() # 对原始表格的行进行转置操作 old_col_table.rows = self.transpose_table_rows(old_table.rows) # 根据原始表格的表头类型,设置转置后的表头类型 old_col_table.head_type = 'vertical' if old_table.head_type == 'horizontal' else 'horizontal' # 复制原始表格的属性到转置后的表格 self.copy_table_attrs(old_col_table, old_table) # 对目标表格的行进行转置操作 new_col_table.rows = self.transpose_table_rows(new_table.rows) # 根据目标表格的表头类型,设置转置后的表头类型 new_col_table.head_type = 'vertical' if new_table.head_type == 'horizontal' else 'horizontal' # 复制目标表格的属性到转置后的表格 self.copy_table_attrs(new_col_table, new_table) # 返回转置后的两个表格对象 return old_col_table, new_col_table def compare_ordered_tables(self, block_name, old_table_obj, new_table_obj, belong_to): row_updates, col_updates, cell_updates = [], [], [] # 获取新旧行数据 old_rows = getattr(old_table_obj, 'rows', []) new_rows = getattr(new_table_obj, 'rows', []) old_cells_list = [row.cells for row in old_rows] new_cells_list = [row.cells for row in new_rows] # 获取内容用于对比 old_content_cells_list = self.get_cell_content_list(old_cells_list, settings.DIFF_ATTR) new_content_cells_list = self.get_cell_content_list(new_cells_list, settings.DIFF_ATTR) # 删除完全一样的匹配 for row_index in range(len(old_content_cells_list) - 1, -1, -1): # 如果新旧行内容相同,则删除该行 # 之后可以在这里增加原子操作逻辑,避免删除不同步 if old_content_cells_list[row_index] == new_content_cells_list[row_index]: old_content_cells_list.pop(row_index) new_content_cells_list.pop(row_index) old_cells_list.pop(row_index) new_cells_list.pop(row_index) #原子一致性检查 flag = False if len(old_content_cells_list) ==len(new_content_cells_list)==len(old_cells_list) == len(new_cells_list): flag = True if not flag: logger.warning(f"{block_name} old_table_obj: {old_table_obj}, new_table_obj: {new_table_obj}; delete operator is not atomic; all the cells list will involved in finding differences computation") if not old_content_cells_list: return [], [], [] # 查找差异 diff_type, row_diffs, col_diffs, cell_diffs, cell_diff_points, cell_diff_values, \ row_diff_idx, col_diff_idx, graphic_diff, picture_diff = self.find_differences( old_content_cells_list, new_content_cells_list, old_cells_list, new_cells_list) # 抽取单元格内的图形图像差分 for item in graphic_diff + picture_diff: if item: cell_updates.extend(item) # 处理单元格差分 for idx, (cell_diff_idx, diff_point, diff_value) in enumerate( zip(cell_diffs, cell_diff_points, cell_diff_values)): try: # old = self.get_element_by_index(old_cells_list, cell_diff_idx) # new = self.get_element_by_index(new_cells_list, cell_diff_idx) old, new = old_cells_list, new_cells_list for cell_idx in cell_diff_idx: old = old[cell_idx] new = new[cell_idx] except IndexError: continue # 忽略非法索引 cell_diff_obj = DiffItem( 'update', 'table', 'cell', block_name=block_name, old=old, new=new, belong_to=belong_to, diff_point=diff_point, diff_values=diff_value ) cell_updates.append(cell_diff_obj) # 处理行差分 # if diff_type == 'row': # for row_idx, row_diff_col_idx in zip(row_diffs, row_diff_idx): # try: # old_row = [old_cells_list[row_idx][cell_idx] for cell_idx in row_diff_col_idx] # new_row = [new_cells_list[row_idx][cell_idx] for cell_idx in row_diff_col_idx] # except IndexError: # continue # # row_diff_item = DiffItem( # 'update', 'table', 'row', # block_name=block_name, # old=self.merge_cells_to_row(old_row), # new=self.merge_cells_to_row(new_row), # belong_to=belong_to) # row_updates.append(row_diff_item) # 处理列差分 # elif diff_type == 'col': # for col_idx, col_diff_col_idx in zip(col_diffs, col_diff_idx): # try: # old_col = [old_cells_list[cell_idx][col_idx] for cell_idx in col_diff_col_idx] # new_col = [new_cells_list[cell_idx][col_idx] for cell_idx in col_diff_col_idx] # except IndexError: # continue # # col_diff_item = DiffItem( # 'update', 'table', 'col', # block_name=block_name, # old=self.merge_cells_to_row(old_col), # new=self.merge_cells_to_row(new_col), # belong_to=belong_to # ) # col_updates.append(col_diff_item) return row_updates, col_updates, cell_updates def choice_cols(self, table_obj, col_indexes): if table_obj.head_type == 'horizontal': rows = [] for row_obj in table_obj.rows: cells = [] for cel_idx in col_indexes: cells.append(row_obj.cells[cel_idx]) rows.append(cells) else: rows = [[] for _ in range(len(table_obj.rows[0].cells))] for cel_idx in col_indexes: for idx, cell in enumerate(table_obj.rows[cel_idx].cells): rows[idx].append(cell) res_table_obj = TableObject() for cell_list in rows: row_obj = RowObject() if cell_list: row_obj.cells = cell_list row_obj.coordinate = cell_list[0].coordinate # 对cell_obj的layout.parent_ref进行判断,有值在进行赋值 if cell_list[0].layout.parent_ref: row_obj.layout = cell_list[0].layout.parent_ref.layout row_obj.style = cell_list[0].layout.parent_ref.style row_obj.border = cell_list[0].layout.parent_ref.border row_obj.row_index = cell_list[0].row_index row_obj.data_id = cell_list[0].data_id # res_table_obj.rows.append(self.merge_cells_to_row(cell_list)) res_table_obj.rows.append(row_obj) self.copy_table_attrs(res_table_obj, table_obj) return res_table_obj @staticmethod def process_delete_add_diff(block_name, sub_type, delete_tables, add_tables, belong_to, head_type=None): def process_graphic_objects(action, cell_list): """ 辅助函数:处理单元格中的图形对象和图片对象。 action: 操作类型('delete' 或 'add') cells_list: 单元格列表 """ diff_items = [] all_merged_ranges = [] for cell_obj in cell_list: if cell_obj.merged_ranges: # 合并单元格只处理一次 if cell_obj.merged_ranges not in all_merged_ranges: all_merged_ranges.append(cell_obj.merged_ranges) else: continue for item_obj in cell_obj.content: if is_instance_of(item_obj, GraphicObject) or is_instance_of(item_obj, PictureObject): # 检查是否是图形或图片对象 diff_items.append( DiffItem(action, item_obj._type, sub_type=item_obj._type, block_name=block_name, old=item_obj if action == 'delete' else None, new=None if action == 'delete' else item_obj, belong_to=belong_to) ) return diff_items # filter_duplicate_cells 过滤在一行或者一列中因合并单元格引起的重复 # 相关代码暂时先不启用,可以在后续使用者启用查看是否会引起漏差分的问题在决定是否启用 # 如果在解析端可以处理合并单元格,则不需要过滤,避免冗余处理而降低效率 def filter_duplicate_cells(item,sub_type): """ 根据text和merged_ranges过滤掉cells_list中的合并单元格 Args: item: RowObject or TableObject """ if sub_type != 'table': seen_contents = defaultdict(list) for i in range(len(item.cells) - 1, -1, -1): cell = item.cells[i] cell_merged_ranges = cell.merged_ranges if not cell_merged_ranges: continue cell_text = cell.text if cell_merged_ranges == seen_contents[cell_text]: del item.cells[i] continue seen_contents[cell_text] = cell_merged_ranges else: for row in item.rows: seen_contents = defaultdict(list) for i in range(len(row.cells) - 1, -1, -1): cell = row.cells[i] cell_merged_ranges = cell.merged_ranges if not cell_merged_ranges: continue cell_text = cell.text if cell_merged_ranges == seen_contents[cell_text]: del row.cells[i] continue seen_contents[cell_text] = cell_merged_ranges return item ls_tb_add, ls_tb_delete = [], [] for tb_base_item in delete_tables: # 过滤(行、列)合并单元格 tb_base_item = filter_duplicate_cells(tb_base_item,sub_type) diff_obj = DiffItem('delete', 'table', sub_type=sub_type, block_name=block_name, old=tb_base_item, new=None, belong_to=belong_to) setattr(diff_obj, 'head_type', head_type) ls_tb_delete.append(diff_obj) # 如果是表格、行或列,处理单元格中的内容 if sub_type in ('row', 'col', 'table'): cells_list = tb_base_item.cells if sub_type != 'table' else [ cell for row in tb_base_item.rows for cell in row.cells] ls_tb_delete.extend(process_graphic_objects('delete', cells_list)) for tb_target_item in add_tables: # 过滤(行、列)合并单元格 tb_target_item = filter_duplicate_cells(tb_target_item,sub_type) diff_obj = DiffItem('add', 'table', sub_type=sub_type, block_name=block_name, old=None, new=tb_target_item, belong_to=belong_to) setattr(diff_obj, 'head_type', head_type) ls_tb_add.append(diff_obj) # 如果是表格、行或列,处理单元格中的内容 if sub_type in ('row', 'col', 'table'): cells_list = tb_target_item.cells if sub_type != 'table' else [ cell for row in tb_target_item.rows for cell in row.cells] ls_tb_add.extend(process_graphic_objects('add', cells_list)) return ls_tb_delete, ls_tb_add def transpose_table_rows(self, rows): """ 将表格的行进行转置操作,即将行转换为列,列转换为行。 Args: rows (list): 原始表格的行列表,每个元素是一个RowObject对象 Returns: list: 返回转置后的行列表,每个元素是一个RowObject对象 """ # 创建新的行对象列表,数量等于原始表格的最大列数 max_cell_count = 0 for row in rows: if len(row.cells) > max_cell_count: max_cell_count = len(row.cells) t_rows = [RowObject() for _ in range(max_cell_count)] # 遍历原始表格的每一行 for row in rows: # 遍历每一行的单元格 for idx, cell in enumerate(row.cells): # 将单元格添加到转置后的对应行中 t_rows[idx].cells.append(cell) # 为转置后的每一行设置属性 for row in t_rows: # 设置行的坐标为第一个单元格的坐标 row.coordinate = row.cells[0].coordinate # 设置行的数据ID为第一个单元格的数据ID row.data_id = row.cells[0].data_id # 设置行的布局为第一个单元格的布局 row.layout = row.cells[0].layout # 如果第一个单元格有列索引,则设置行的列索引 if isinstance(row.cells[0].col_index, int): row.col_index = row.cells[0].col_index # 如果第一个单元格有行索引,则设置行的行索引 if isinstance(row.cells[0].row_index, int): row.row_index = row.cells[0].row_index # 返回转置后的行列表 return t_rows def find_differences(self, array1: list, array2: list, old_items, new_items, diff_mode='normal'): if isinstance(array1, list): array1 = np.array(array1) if isinstance(array2, list): array2 = np.array(array2) # 确保两个ndarray的shape相同 if array1.shape != array2.shape: raise ValueError("两个ndarray的shape必须相同") diff_type = diff_mode if diff_type == 'normal': # 计算行差异数 row_diff_count = np.sum(~np.all(array1 == array2, axis=1)) # 计算列差异数 col_diff_count = np.sum(~np.all(array1 == array2, axis=0)) # 根据差异数选择差异类型 diff_type = 'col' if col_diff_count < row_diff_count else 'row' # 找出所有行和列的差异项 row_diffs, col_diffs, cell_diffs, cell_diff_points, cell_diff_values, row_diff_idx, col_diff_idx, graphic_diffs, picture_diffs = [ [] for _ in range(9)] if diff_type == 'row': for i in range(array1.shape[0]): if not np.all(array1[i] == array2[i]): # if np.sum(array1[i] != array2[i]) < settings.T_MERGE_MULTI_CELL_UPDATE_TO_ROW_UPDATE_MIN_CELL_COUNT: # 如果该行只有一个数值不一致,则将这个差异项改为单元格的差异项 for j in range(array1.shape[1]): if array1[i, j] == array2[i, j]: continue diff_point, diff_values, graphic_diff, picture_diff = self.get_cell_not_equal_attrs( old_items[i][j], new_items[i][j], settings.CELL_COMPARE_ATTRS) if diff_point: cell_diffs.append((i, j)) cell_diff_points.append(' '.join(diff_point)) cell_diff_values.append(diff_values) if graphic_diff: graphic_diffs.append(graphic_diff) if picture_diff: picture_diffs.append(picture_diff) # else: # row_diffs.append(i) # row_diff_idx.append(np.where(array1[i] != array2[i])[0].tolist()) else: for j in range(array1.shape[1]): if not np.all(array1[:, j] == array2[:, j]): # if np.sum(array1[:, j] != array2[:,j]) < settings.T_MERGE_MULTI_CELL_UPDATE_TO_ROW_UPDATE_MIN_CELL_COUNT: # 如果该列只有一个数值不一致,则将这个差异项改为单元格的差异项 for i in range(array1.shape[0]): if array1[i, j] == array2[i, j]: continue diff_point, diff_values, graphic_diff, picture_diff = self.get_cell_not_equal_attrs( old_items[i][j], new_items[i][j], settings.CELL_COMPARE_ATTRS) if diff_point: cell_diffs.append((i, j)) cell_diff_points.append(' '.join(diff_point)) cell_diff_values.append(diff_values) if graphic_diff: graphic_diffs.append(graphic_diff) if picture_diff: picture_diffs.append(picture_diff) # else: # col_diffs.append(j) # col_diff_idx.append(np.where(array1[:, j] != array2[:, j])[0].tolist()) # 返回所有差异类型对应的单元格索引 return diff_type, row_diffs, col_diffs, cell_diffs, cell_diff_points, cell_diff_values, row_diff_idx, col_diff_idx, graphic_diffs, picture_diffs @staticmethod def get_cell_chars(cell_obj): chars = [] for text_obj in cell_obj.content: if not is_instance_of(text_obj, TextObject): continue chars.extend(text_obj.get_chars()) return chars def _compare_cell_diff(self, base, target, data_type, block_name=''): """ 对比单元格图像的方法 """ result = [] if data_type == 'graphic': cp_obj = GraphicComparer(self._base_block_mapping, self._target_block_mapping, self._path_base, self._path_target) else: cp_obj = PictureComparer(self._base_block_mapping, self._target_block_mapping, self._path_base, self._path_target) item_result = cp_obj.compare(block_name, base, target, 'cell', True) result.extend(item_result['add']) result.extend(item_result['delete']) result.extend(item_result['update']) return result @staticmethod def _get_graphic_picture_obj(old_cell, new_cell): base_graphic = [] target_graphic = [] base_picture = [] target_picture = [] for base_item in old_cell.content: if is_instance_of(base_item, GraphicObject): base_graphic.append(base_item) elif is_instance_of(base_item, PictureObject): base_picture.append(base_item) for new_item in new_cell.content: if is_instance_of(new_item, GraphicObject): target_graphic.append(new_item) elif is_instance_of(new_item, PictureObject): target_picture.append(new_item) return [(base_graphic, target_graphic), (base_picture, target_picture)] def _get_cell_graphic_picture_diff(self, old_cell, new_cell): """ 对比单元格图形图像的方法 """ graphic_diff = [] picture_diff = [] block = old_cell while not isinstance(block, DocumentBlockObject) and block and hasattr(block, 'layout'): block = block.layout.parent_ref block_name = block.name if block else '' graphic_obj, picture_obj = self._get_graphic_picture_obj(old_cell, new_cell) if graphic_obj[0] or graphic_obj[1]: graphic_diff = self._compare_cell_diff(graphic_obj[0], graphic_obj[1], 'graphic', block_name) if picture_obj[0] or picture_obj[1]: picture_diff = self._compare_cell_diff(picture_obj[0], picture_obj[1], 'picture', block_name) return graphic_diff, picture_diff def get_cell_not_equal_attrs(self, old_cell, new_cell, compare_attrs): diff_attrs = [] diff_values = [] if getattr(old_cell, 'auto_number', None) and getattr(new_cell, 'auto_number', None): return [], [], [], [] if old_cell.text != new_cell.text: diff_attrs.append('text') diff_values.append((old_cell.text, new_cell.text)) else: # 直接在对象上取值的属性 # direct_attr = ['style.background_color', 'border.border_top.border_style', 'style.background_color', # 'border.border_bottom.border_style', 'border.border_left.border_style', # 'border.border_right.border_style', 'style.background_style'] direct_attr = ['style.background_color', 'style.background_style'] attrs, values = self.get_not_equal_attrs(old_cell, new_cell, direct_attr) diff_attrs.extend(attrs) diff_values.extend(values) for old_char, new_char in zip_longest(self.get_cell_chars(old_cell), self.get_cell_chars(new_cell), fillvalue=None): if old_char is None or new_char is None: diff_attrs.append('text') diff_values.append((str(old_char), str(new_char))) else: attrs, values = self.get_not_equal_attrs(old_char, new_char, compare_attrs) diff_attrs.extend(attrs) diff_values.extend(values) # 单元格增加图形图像的比较 graphic_diff, picture_diff = self._get_cell_graphic_picture_diff(old_cell, new_cell) unique_diff_attrs = list(set(diff_attrs)) unique_not_equal_values = [diff_values[diff_attrs.index(v)] for v in unique_diff_attrs] return unique_diff_attrs, unique_not_equal_values, graphic_diff, picture_diff def get_cell_content_list(self, cell_obj_lists, with_attr=False): content_lists = [] processed_merged_ranges = set() for cell_obj_list in cell_obj_lists: row_content_list = [] for cell_obj in cell_obj_list: # 检查是否是合并单元格且已经处理过 if hasattr(cell_obj, 'merged_ranges') and cell_obj.merged_ranges: # 创建一个基于合并范围和内容的唯一键 merged_key = (tuple(cell_obj.merged_ranges), str(getattr(cell_obj, 'text', ''))) if merged_key in processed_merged_ranges: # 如果已经处理过,设置为空字符串 row_content_list.append('') # 直接添加空字符串到结果中 continue else: # 如果是合并单元格但未处理过,标记为已处理 processed_merged_ranges.add(merged_key) cell_contents = [f'text:{cell_obj.text}'] if with_attr: attr_list = settings.CELL_COMPARE_ATTRS for attr in attr_list: if attr == "text": continue attr_value = self.get_nest_attr(cell_obj, attr) if attr_value not in (None, ''): cell_contents.append(f'{attr}:{str(attr_value)}') row_content_list.append('🙉'.join(cell_contents)) content_lists.append(row_content_list) return content_lists def get_nest_attr(self, obj, nest_attr): if is_instance_of(obj, CellObject): result_attr = [] # 特殊处理单元格背景色 # if nest_attr in ('style.background_color', 'border.border_top.border_style', # 'border.border_bottom.border_style', 'border.border_left.border_style', # 'border.border_right.border_style', 'style.background_style'): if nest_attr in ('style.background_color', 'style.background_style'): return self.get_target_attr(obj, nest_attr) if nest_attr == 'font_background_color': nest_attr = 'style.background_color' for item_obj in obj.content: if is_instance_of(item_obj, GraphicObject) and nest_attr == 'graphic': for item_attr in settings.PICTURE_COMPARE_ATTRS: run_attr = self.get_target_attr(item_obj, item_attr) if run_attr and str(run_attr) not in result_attr: result_attr.append(str(run_attr)) graphic_text_obj = getattr(item_obj, 'text_obj', None) if graphic_text_obj and graphic_text_obj.text: text_attr_list = [] for text_attr in settings.TEXT_COMPARE_ATTRS: for run_obj in graphic_text_obj.runs: attr_val = self.get_target_attr(run_obj, text_attr) if attr_val and str(attr_val) not in text_attr_list: text_attr_list.append(str(attr_val)) if text_attr_list: result_attr.extend(text_attr_list) elif is_instance_of(item_obj, PictureObject) and nest_attr == 'picture': for item_attr in settings.PICTURE_COMPARE_ATTRS: run_attr = self.get_target_attr(item_obj, item_attr) if run_attr and str(run_attr) not in result_attr: result_attr.append(str(run_attr)) else: if is_instance_of(item_obj, TextObject): for run_obj in item_obj.runs: run_attr = self.get_target_attr(run_obj, nest_attr) if run_attr and str(run_attr) not in result_attr: result_attr.append(str(run_attr)) elif is_instance_of(item_obj, RunObject): run_attr = self.get_target_attr(item_obj, nest_attr) if run_attr and str(run_attr) not in result_attr: result_attr.append(str(run_attr)) return "".join(result_attr) elif is_instance_of(obj, TextObject): result_attr = [] for run_obj in obj.runs: run_attr = self.get_target_attr(run_obj, nest_attr) if run_attr and str(run_attr) not in result_attr: result_attr.append(str(run_attr)) else: return self.get_target_attr(obj, nest_attr) @staticmethod def get_target_attr(obj, nest_attr): nest_attrs = nest_attr.split('.') attr_str = nest_attrs.pop(0) base_attr = getattr(obj, attr_str, None) while base_attr and nest_attrs: attr_str = nest_attrs.pop(0) base_attr = getattr(base_attr, attr_str, None) return base_attr # @staticmethod # def merge_cells_to_row(cell_list): # row_obj = RowObject() # # if cell_list: # row_obj.cells = cell_list # row_obj.coordinate = cell_list[0].coordinate # # 对cell_obj的layout.parent_ref进行判断,有值在进行赋值 # if cell_list[0].layout.parent_ref: # row_obj.layout = cell_list[0].layout.parent_ref.layout # row_obj.style = cell_list[0].layout.parent_ref.style # row_obj.border = cell_list[0].layout.parent_ref.border # row_obj.row_index = cell_list[0].row_index # row_obj.data_id = cell_list[0].data_id # return row_obj @staticmethod def align_table_col(base_table, target_table): base_max_col_count = max([len(row.cells) for row in base_table.rows]) target_max_col_count = max([len(row.cells) for row in target_table.rows]) for base_row in base_table.rows: if len(base_row.cells) != base_max_col_count: # 匹配行的列数不一致,补齐缺失的cell add_col_count = abs(len(base_row.cells) - base_max_col_count) base_row.cells.extend([CellObject() for _ in range(add_col_count)]) for target_row in target_table.rows: if len(target_row.cells) != target_max_col_count: # 匹配行的列数不一致,补齐缺失的cell add_col_count = abs(len(target_row.cells) - target_max_col_count) target_row.cells.extend([CellObject() for _ in range(add_col_count)]) def cell_update_after(self, update_cells): """ 单元格的变更后处理, 只有在content和merged_ranges都一样的情况下才过滤重复项 :return: """ if not update_cells: return update_cells result = [] custom_cells_merged_ranges = list() seen_cells = defaultdict(list) def normalize_content(cell): """标准化单元格内容用于比较""" if not cell: return "" # 获取文本内容并标准化 content_text = str(cell.text) if hasattr(cell, 'text') else "" # 标准化换行符 normalized = content_text.strip().replace('\r\n', '\n').replace('\r', '\n') return normalized def get_cell_key(item): """生成用于比较的键""" old_cell = getattr(item, 'old', None) new_cell = getattr(item, 'new', None) # 获取内容键 old_content = normalize_content(old_cell) new_content = normalize_content(new_cell) content_key = f"{old_content}|{new_content}" # 获取合并范围键 old_range = tuple(old_cell.merged_ranges) if old_cell and hasattr(old_cell, 'merged_ranges') and old_cell.merged_ranges else () new_range = tuple(new_cell.merged_ranges) if new_cell and hasattr(new_cell, 'merged_ranges') and new_cell.merged_ranges else () range_key = f"{old_range}|{new_range}" return f"{content_key}||{range_key}" def get_is_custom_cell(cell_obj): for c_obj in cell_obj.get_heads(): if c_obj.text == settings.SPECIAL_CELL_CONTENT3: return True for item in update_cells: # 如果不是单元格更新或者没有old/new对象,直接添加到结果中 if (item.type != 'update' or item.data_type != 'table' or item.sub_type != 'cell' or (not item.old or not item.old.merged_ranges) and (not item.new or not item.new.merged_ranges)): result.append(item) continue current_old_range = getattr(item.old, 'merged_ranges', []) if item.old else [] current_new_range = getattr(item.new, 'merged_ranges', []) if item.new else [] # 特殊定制的表格累加处理 if item.old and get_is_custom_cell(item.old): if current_old_range not in custom_cells_merged_ranges: custom_cells_merged_ranges.add(current_old_range) result.append(item) else: existing_idx = custom_cells_merged_ranges.index(current_old_range) result[existing_idx].old.text += item.old.text result[existing_idx].old.content.extend(item.old.content) elif item.new and get_is_custom_cell(item.new): if current_new_range not in custom_cells_merged_ranges: custom_cells_merged_ranges.append(current_new_range) result.append(item) else: existing_idx = custom_cells_merged_ranges.index(current_new_range) result[existing_idx].new.text += item.new.text result[existing_idx].new.content.extend(item.new.content) # 检查是否只有单侧有合并范围, 如果只有单侧有合并范围,则不视为重复 # elif len(current_old_range) <4 or len(current_new_range)<4: # result.append(item) else: # 处理普通单元格 - 进行去重检查 # 生成用于比较的键 cell_key = get_cell_key(item) # 检查是否已经存在相同的键 is_duplicate = False for existing_idx in seen_cells[cell_key]: existing_item = result[existing_idx] # 获取当前和已存在项目的合并范围 existing_old_range = getattr(existing_item.old, 'merged_ranges', []) if existing_item.old else [] existing_new_range = getattr(existing_item.new, 'merged_ranges', []) if existing_item.new else [] # 只有当merged_ranges完全相同时才认为是重复 if (current_old_range == existing_old_range and current_new_range == existing_new_range): is_duplicate = True break if not is_duplicate: seen_cells[cell_key].append(len(result)) result.append(item) # 如果是重复项,则忽略(不添加到结果中) return result def row_del_add_after(self, part, category='add'): """ 根据 category 参数处理新增或删除的行对象,判断行中的单元格是否有 merged_ranges 属性。 如果行中的任意一个单元格没有 merged_ranges 属性,则添加到结果列表中。 同时过滤具有相同 merged_ranges 的重复行对象,仅保留第一个出现的行。 注意:对于两列(行)中至少共享一个合并单元格,同时两列(行)内容完全相同,依然有可能会被误删除 解决方案:需要解析提供所有的单元格范围,之后综合计算整列(行)的范围进行判断, 若整列(行)都是因合并单元格而造成的冗余则进行过滤,否则(如只共享一(多)个合并单元格)则保留 :param part: 行对象列表 :param category: 操作类型,'add' 或 'delete' :return: 处理后的结果列表 """ # 使用列表来保存拼接后的列表 result = [] if not part: return part if category not in ('add', 'delete'): return part # 根据 category 决定处理新增还是删除的行对象 merged_rows = [] for row in part: # 检查是否是行对象;(会有PictureObject和GraphicObject)如不是则直接加入结果中 if category == 'add' and not is_instance_of(row.new, RowObject): result.append(row) continue if category == 'delete' and not is_instance_of(row.old, RowObject): result.append(row) continue # 获取要检查的单元格列表 cells = row.new.cells if category == 'add' else row.old.cells # 检查行中的每个单元格是否有 merged_ranges 属性 has_merged_ranges = any(hasattr(cell, 'merged_ranges') for cell in cells) # 如果行中的任意一个单元格没有 merged_ranges 属性,则添加到结果中 if not has_merged_ranges: result.append(row) else: merged_rows.append(row) # 处理具有 merged_ranges 的行,过滤重复项 if merged_rows: seen_contents = defaultdict(list) def remove_timestamp(text): return re.sub(r'\d{4}[-/]\d{2}[-/]\d{2}.*?(?=\t|\n|$)', '', text) for index, row in enumerate(merged_rows): # 获取当前行的内容 content = getattr(row, 'new_content' if category == 'add' else 'old_content', None) if content: #确保在处理可能包含非UTF-8编码字符的文本时不会出现解码错误 if isinstance(content, bytes): content = content.decode('utf-8', errors='replace') elif not isinstance(content, str): content = str(content) # 标准化 content cleaned_content = remove_timestamp(content) normalized_content = cleaned_content.strip().replace('\r\n', '\n').replace('\r', '\n') seen_contents[normalized_content].append(index) duplicates_row = {item: indices for item, indices in seen_contents.items() if len(indices) > 1} removed_rows_indices = [] for _, indices in duplicates_row.items(): seen_merged_ranges = set() for i in indices: # 获取要检查的单元格列表 cells = merged_rows[i].new.cells if category == 'add' else merged_rows[i].old.cells for cell in cells: if cell.merged_ranges: merged_range_tuple = tuple(cell.merged_ranges) if merged_range_tuple not in seen_merged_ranges: seen_merged_ranges.add(merged_range_tuple) break else: removed_rows_indices.append(i) break # 添加未被移除的行到结果中 for index, row in enumerate(merged_rows): if index not in removed_rows_indices: result.append(row) return result def pre_process_require(self, old_table, new_table): base_resources = [old_table] target_resources = [new_table] changes_dict = {} # 存储变更信息的字典 # 合并表格并编号(0=变更前,1=变更后) for table_idx, table in enumerate(base_resources + target_resources): col_list = table.get_col_list(col_name=settings.SPECIAL_COLUMN) #'要求廃止' # 在循环外部初始化计数器 be_counter = 1 af_counter = 1 if col_list: for row_index, cell in enumerate(col_list): cell_text = getattr(cell, 'text', '') if cell_text == settings.SPECIAL_CELL_CONTENT2: #'レ' if table_idx < len(base_resources): table_key = f"be_{be_counter:02d}" be_counter += 1 else: table_key = f"af_{af_counter:02d}" af_counter += 1 # 获取列索引 col_index = cell.col_index # 存储变更位置信息 changes_dict[table_key] = (row_index, col_index) if any(changes_dict): # 分离变更前和变更后的数据 be_changes = {k: v for k, v in changes_dict.items() if k.startswith('be_')} af_changes = {k: v for k, v in changes_dict.items() if k.startswith('af_')} # 处理变更前的数据 # 记录需要清理的key(分前后表) be_clear_keys = [] af_clear_keys = [] if be_changes: for table_key, (row_idx, col_idx) in be_changes.items(): next_col_idx = col_idx + 1 if next_col_idx < len(base_resources[0].rows[row_idx].cells): if self._tbl_find_unique(base_resources[0], target_resources[0], row_idx, col_idx + 1): be_clear_keys.append(table_key) # 处理变更后的数据 if af_changes: for table_key, (row_idx, col_idx) in af_changes.items(): next_col_idx = col_idx + 1 if next_col_idx < len(target_resources[0].rows[row_idx].cells): if self._tbl_find_unique(target_resources[0], base_resources[0], row_idx, col_idx + 1): af_clear_keys.append(table_key) # 统一清理(分表操作) for table_key in be_clear_keys: row_idx, col_idx = changes_dict[table_key] # 清理前表(base_resources) for cell in base_resources[0].rows[row_idx].cells: cell.text = '' cell.content = [] cell.style = StyleObject() for table_key in af_clear_keys: row_idx, col_idx = changes_dict[table_key] # 清理后表(target_resources) for cell in target_resources[0].rows[row_idx].cells: cell.text = '' cell.content = [] cell.style = StyleObject() return base_resources[0], target_resources[0] @staticmethod def _tbl_find_unique(base_tbl, target_tbl, row_idx, col_idx): """校验指定单元格是否满足唯一性条件: 1. 在 base_tbl 对应列中不存在相同值 2. 在 target_tbl 当前列中唯一(排除自己) 返回:是否需要清理(True/False) """ if not target_tbl or not base_tbl: return False target_rows = target_tbl.rows if row_idx >= len(target_rows) or col_idx >= len(target_rows[row_idx].cells): return False compare_text = str(target_rows[row_idx].cells[col_idx].text) # 条件1:检查 base_tbl 对应列是否存在相同值 if base_tbl.rows and col_idx < len(base_tbl.rows[0].cells): for row in base_tbl.rows: if col_idx < len(row.cells) and str(row.cells[col_idx].text) == compare_text: return True # 需要清理 # 条件2:检查 target_tbl 当前列是否有重复(排除自己) for i, row in enumerate(target_rows): if i == row_idx: continue if col_idx < len(row.cells) and str(row.cells[col_idx].text) == compare_text: return True # 需要清理 return False # 无需清理 新增的表格匹配的设置表头这段代码写的忽视了表格结构 2025-12-04 14:48:57 - [MainThread] - ERROR - (compare_parser.py:80) $ compare) ::: compare table is error:list indices must be integers or slices, not str Traceback (most recent call last): File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\compare_parser.py", line 78, in compare plugin_dr.extend_result(plugin_instance.compare_each_block(CustomPreDiffStrategyMiddleware)) File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\comparers\base_comparer.py", line 88, in compare_each_block block_result = self.compare(block_name, ls_base, ls_target, belong_to) File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\comparers\table_comparer.py", line 49, in compare part_delete, part_add, part_update = self.compare_table(block_name, old_table, new_table,belong_to=belong_to) File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\comparers\table_comparer.py", line 197, in compare_table col_matched = CustomTableStrategyMiddleware(self._path_base).match_row(del_cols, add_cols,is_col=True, File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\middlewares\table_middlewares.py", line 605, in match_row return self.strategy.match_rows(base_rows, target_rows, is_col, head_indexes) File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\middlewares\table_strategy.py", line 201, in match_rows matched_strategies = self.registry.get_matched_row_strategies(base_rows, table) File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\middlewares\table_strategy.py", line 87, in get_matched_row_strategies if table and condition(table): File "D:\Venv\RRM_env\lib\site-packages\kotei_omc\middlewares\table_middlewares.py", line 862, in _is_header_rules_valid heads = base_table.get_heads() File "D:\Venv\RRM_env\lib\site-packages\kotei_omp\data\table.py", line 322, in get_heads heads.append(self.rows[index].cells) TypeError: list indices must be integers or slices, not str 再观察一下整段代码,重构一下这段设置表头的代码: # === 设置默认表头类型 === DEFAULT_HEADER = 'horizontal' if not hasattr(old_table, 'head_type') or old_table.head_type is None: old_table.head_type = DEFAULT_HEADER if not hasattr(new_table, 'head_type') or new_table.head_type is None: new_table.head_type = DEFAULT_HEADER # === 表头有效性检查函数 === def is_valid_header(row): if not row or not row.cells: return False total_cells = len(row.cells) empty_cells = sum(1 for cell in row.cells if (cell.text is None or str(cell.text).strip() == '') and not cell.content) return (empty_cells / total_cells) < 0.1 # === 关键修复:设置实际表头行 === # 设置默认表头行为第一行 old_table.header_row_idx = 0 new_table.header_row_idx = 0 # 检查并确认表头行 if old_table.head_type == 'horizontal' and old_table.rows: # 检查第一行是否有效 if not is_valid_header(old_table.rows[0]): logger.warning(f"第一行无效表头 in old table {old_table.data_id}") # 尝试查找后续有效行作为表头 for idx in range(1, min(3, len(old_table.rows))): # 最多检查前3行 if is_valid_header(old_table.rows[idx]): old_table.header_row_idx = idx logger.info(f"设置第{idx + 1}行为表头 in old table") break else: old_table.head_type = None # 未找到有效表头 logger.warning(f"未找到有效表头 in old table {old_table.data_id}") # 新表同样处理 if new_table.head_type == 'horizontal' and new_table.rows: if not is_valid_header(new_table.rows[0]): logger.warning(f"第一行无效表头 in new table {new_table.data_id}") for idx in range(1, min(3, len(new_table.rows))): if is_valid_header(new_table.rows[idx]): new_table.header_row_idx = idx logger.info(f"设置第{idx + 1}行为表头 in new table") break else: new_table.head_type = None logger.warning(f"未找到有效表头 in new table {new_table.data_id}") # === 设置表头内容 === # 从确定的表头行提取表头内容 if old_table.head_type == 'horizontal' and old_table.rows: header_row = old_table.rows[old_table.header_row_idx] old_table.head_list = [cell.text for cell in header_row.cells] if new_table.head_type == 'horizontal' and new_table.rows: header_row = new_table.rows[new_table.header_row_idx] new_table.head_list = [cell.text for cell in header_row.cells]
12-05
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值