Description Resource Path Location Type The prefix “p” for attribute “p:sessionFactory-ref” associat

本文详细介绍了如何在 Spring Boot 应用中配置缓存管理器并整合 Ehcache,包括 XML 配置文件示例、Ehcache 配置文件结构、以及如何设置不同缓存策略。

Description Resource Path Location Type The prefix “p” for attribute “p:sessionFactory-ref” associated with an element type “beans:bean” is not bound

在xml配置文件中加入

xmlns:p="http://www.springframework.org/schema/p"

 

完整的缓存配置文件

<?xml version= "1.0" encoding = "UTF-8"?>
<beans xmlns= "http://www.springframework.org/schema/beans"
		xmlns:p="http://www.springframework.org/schema/p"
      xmlns:xsi= "http://www.w3.org/2001/XMLSchema-instance"
      xmlns:cache= "http://www.springframework.org/schema/cache"
      xsi:schemaLocation= "
          http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
          http://www.springframework.org/schema/cache http://www.springframework.org/schema/cache/spring-cache.xsd">
 
      <cache:annotation-driven cache-manager ="ehcacheCacheManager" />
      <bean id = "ehcacheCacheManager" class= "org.springframework.cache.ehcache.EhCacheCacheManager"
           p:cacheManager-ref= "cacheManager" />
      <bean id = "cacheManager"
           class= "org.springframework.cache.ehcache.EhCacheManagerFactoryBean" >
           <property name = "configLocation" value= " config/ehcache.xml" />
      </bean>
</beans>


对应的ehcache.xml

 

 

 

<ehcache xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"  xsi:noNamespaceSchemaLocation="http://ehcache.org/ehcache.xsd">
    
    <diskStore path="java.io.tmpdir"/>
	<defaultCache
            maxElementsInMemory="10000"
            eternal="true"
            timeToIdleSeconds="120"
            timeToLiveSeconds="120"
            overflowToDisk="true"
            maxElementsOnDisk="10000000"
            diskPersistent="false"
            diskExpiryThreadIntervalSeconds="120"
            memoryStoreEvictionPolicy="LRU"
            />
	<cache  name="paymentCache"
            maxElementsInMemory="10000"      
            eternal="true"
            overflowToDisk="true"
            maxElementsOnDisk="10000000"
            diskPersistent="false"
            diskExpiryThreadIntervalSeconds="120"
            memoryStoreEvictionPolicy="LRU">
     </cache>
</ehcache>

 

 

 

 

 

2025-11-04 15:03:15 - [MainThread] - INFO - (compare_parser.py:61) $ compare) ::: Found diff plugins: ['text', 'picture', 'graphic', 'table', 'timing_wave', 'timing_text'] 2025-11-04 15:03:15 - [MainThread] - INFO - (compare_parser.py:73) $ compare) ::: Comparing text 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute texts 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 5 resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute texts 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 5 resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get block resource finish 2025-11-04 15:03:15 - [MainThread] - INFO - (pre_diff_strategy.py:88) $ process) ::: 预处理使用定制策略进行匹配, base_num: 5, target_num: 5 2025-11-04 15:03:15 - [MainThread] - INFO - (pre_diff_strategy.py:93) $ process) ::: 预处理finish 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:85) $ compare_each_block) ::: compare block, block_name: Japanese, base_num: 5, target_num: 5 2025-11-04 15:03:15 - [MainThread] - INFO - (text_strategy.py:102) $ match_texts) ::: TxT默认匹配策略 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:269) $ find_best_matched_indexes) ::: lines1 num :5, lines2 num :5 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:337) $ find_best_matched_indexes) ::: find_best_matched_indexes cost time: 0.614957332611084 2025-11-04 15:03:15 - [MainThread] - INFO - (text_strategy.py:106) $ match_texts) ::: TxT匹配结束 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:94) $ compare_each_block) ::: compare block finish 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for header with resource attribute texts 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:381) $ do_get_block_resource) ::: Processing header resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for header, found 0 resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for header with resource attribute texts 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:381) $ do_get_block_resource) ::: Processing header resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for header, found 0 resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get header resource finish 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No header resource found. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for footer with resource attribute texts 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:377) $ do_get_block_resource) ::: Processing footer resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for footer, found 0 resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for footer with resource attribute texts 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:377) $ do_get_block_resource) ::: Processing footer resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for footer, found 0 resources. 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get footer resource finish 2025-11-04 15:03:15 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No footer resource found. 2025-11-04 15:03:15 - [MainThread] - INFO - (post_diff_strategy.py:40) $ get_diff_process_strategies) ::: 后处理匹配定制策略 2025-11-04 15:03:15 - [MainThread] - INFO - (post_diff_strategy.py:45) $ get_diff_process_strategies) ::: 后处理匹配结束 2025-11-04 15:03:15 - [MainThread] - INFO - (post_diff_middlewares.py:291) $ _drop_null_text) ::: post diff strategy: drop_null_text 2025-11-04 15:03:15 - [MainThread] - INFO - (post_diff_middlewares.py:358) $ _add_highlight) ::: post diff strategy: add_highlight Processing directory: D:\GitProjects\REQManagement\SCL_AGE\kotei_scl\check_ground Processing file: D:\GitProjects\REQManagement\SCL_AGE\kotei_scl\check_ground\CheckSheet_test.xlsx Processing sheet: チェックシート (改訂) in file: D:\GitProjects\REQManagement\SCL_AGE\kotei_scl\check_ground\CheckSheet_test.xlsx GD sheet processed, shape: (22, 4) Skipping sheet: SCL抽出 in file: D:\GitProjects\REQManagement\SCL_AGE\kotei_scl\check_ground\CheckSheet_test.xlsx Only GD sheet data available 2025-11-04 15:03:20 - [MainThread] - INFO - (compare_parser.py:73) $ compare) ::: Comparing picture 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute pictures 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute pictures 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get block resource finish 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No block resource found. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for header with resource attribute pictures 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:381) $ do_get_block_resource) ::: Processing header resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for header, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for header with resource attribute pictures 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:381) $ do_get_block_resource) ::: Processing header resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for header, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get header resource finish 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No header resource found. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for footer with resource attribute pictures 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:377) $ do_get_block_resource) ::: Processing footer resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for footer, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for footer with resource attribute pictures 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:377) $ do_get_block_resource) ::: Processing footer resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for footer, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get footer resource finish 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No footer resource found. 2025-11-04 15:03:20 - [MainThread] - INFO - (post_diff_strategy.py:40) $ get_diff_process_strategies) ::: 后处理匹配定制策略 2025-11-04 15:03:20 - [MainThread] - INFO - (post_diff_strategy.py:45) $ get_diff_process_strategies) ::: 后处理匹配结束 2025-11-04 15:03:20 - [MainThread] - INFO - (compare_parser.py:73) $ compare) ::: Comparing graphic 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute graphics 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute graphics 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get block resource finish 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No block resource found. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for header with resource attribute graphics 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:381) $ do_get_block_resource) ::: Processing header resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for header, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for header with resource attribute graphics 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:381) $ do_get_block_resource) ::: Processing header resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for header, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get header resource finish 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No header resource found. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for footer with resource attribute graphics 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:377) $ do_get_block_resource) ::: Processing footer resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for footer, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for footer with resource attribute graphics 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:377) $ do_get_block_resource) ::: Processing footer resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for footer, found 0 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get footer resource finish 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:70) $ compare_each_block) ::: No footer resource found. 2025-11-04 15:03:20 - [MainThread] - INFO - (post_diff_strategy.py:40) $ get_diff_process_strategies) ::: 后处理匹配定制策略 2025-11-04 15:03:20 - [MainThread] - INFO - (post_diff_strategy.py:45) $ get_diff_process_strategies) ::: 后处理匹配结束 2025-11-04 15:03:20 - [MainThread] - INFO - (compare_parser.py:73) $ compare) ::: Comparing table 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute tables 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 8 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:359) $ do_get_block_resource) ::: Start getting block resource for block with resource attribute tables 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:370) $ do_get_block_resource) ::: Processing block resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:393) $ do_get_block_resource) ::: Finished getting block resource for block, found 8 resources. 2025-11-04 15:03:20 - [MainThread] - INFO - (base_comparer.py:67) $ compare_each_block) ::: get block resource finish 2025-11-04 15:03:20 - [MainThread] - INFO - (pre_diff_strategy.py:88) $ process) ::: 预处理使用定制策略进行匹配, base_num: 8, target_num: 8 2025-11-04 15:03:23 - [MainThread] - INFO - (pre_diff_strategy.py:93) $ process) ::: 预处理finish 2025-11-04 15:03:23 - [MainThread] - INFO - (base_comparer.py:85) $ compare_each_block) ::: compare block, block_name: Japanese, base_num: 8, target_num: 8 2025-11-04 15:03:23 - [MainThread] - INFO - (table_comparer.py:27) $ compare) ::: start match table, block_name: Japanese, base_num: 8, target_num: 8 2025-11-04 15:03:23 - [MainThread] - INFO - (table_strategy.py:132) $ match_tables) ::: table使用定制策略进行匹配, base_num: 8, target_num: 8 2025-11-04 15:03:23 - [MainThread] - INFO - (base_comparer.py:269) $ find_best_matched_indexes) ::: lines1 num :0, lines2 num :0 2025-11-04 15:03:23 - [MainThread] - INFO - (base_comparer.py:337) $ find_best_matched_indexes) ::: find_best_matched_indexes cost time: 0.0 2025-11-04 15:03:23 - [MainThread] - INFO - (table_middlewares.py:670) $ _match_by_title) ::: title match index tuples: [] 2025-11-04 15:03:23 - [MainThread] - INFO - (table_strategy.py:159) $ match_tables) ::: tables 使用默认策略匹配 解析一下,为什么会进入默认匹配机制
最新发布
11-05
<think>我们观察日志,发现表格比较时进入了默认匹配机制。日志显示: - 在table_strategy.py的132行开始匹配表格,base_num=8, target_num=8 - 然后调用find_best_matched_indexes,但是传入的lines1和lines2都是0个元素(lines1 num:0, lines2 num:0) - 因此,标题匹配(_match_by_title)返回的匹配索引元组为空([]) - 于是进入默认策略匹配(table_strategy.py:159) 原因分析: - 在表格匹配的预处理阶段,需要提取每个表格的特征(如标题行)用于匹配。 - 日志显示find_best_matched_indexes传入的lines1和lines2都是0,说明没有提取到任何可用于匹配的文本行(可能是标题行)。 可能的原因: 1. 表格没有标题行,或者标题行为空。 2. 提取标题行的逻辑有误,未能正确提取。 3. 表格对象中存储标题行的属性为空或未正确设置。 解决方案: 1. 检查表格对象是否包含标题行信息,以及提取标题行的方法是否正确。 2. 在无法通过标题行匹配时,考虑使用其他特征(如表格位置、表格内容的前几行等)进行匹配。 下面我们模拟一个表格匹配的改进方案,在标题行匹配失败时,使用表格的第一行内容作为匹配依据。 假设我们的表格对象(TableObject)有一个属性`header`表示标题行(可能为None或空),以及一个属性`rows`表示所有行。 改进的匹配策略: - 首先尝试用标题行匹配(如果标题行存在且非空)。 - 如果标题行匹配失败,则尝试用表格的第一行(如果存在)作为匹配特征。 - 如果第一行也没有,则考虑使用表格的位置(如页面中的坐标)进行匹配。 由于我们没有具体的代码,以下是一个改进的匹配策略的示例: </think>### 日志分析及问题原因解析 从提供的日志内容可以看出,表格比较进入默认匹配机制的原因如下: ```plaintext 2025-11-04 15:03:23 - [MainThread] - INFO - (table_strategy.py:132) $ match_tables) ::: table使用定制策略进行匹配, base_num: 8, target_num: 8 2025-11-04 15:03:23 - [MainThread] - INFO - (base_comparer.py:269) $ find_best_matched_indexes) ::: lines1 num :0, lines2 num :0 2025-11-04 15:03:23 - [MainThread] - INFO - (table_middlewares.py:670) $ _match_by_title) ::: title match index tuples: [] 2025-11-04 15:03:23 - [MainThread] - INFO - (table_strategy.py:159) $ match_tables) ::: tables 使用默认策略匹配 ``` ### 问题根本原因 1. **表格标题行提取失败**: - 定制策略尝试提取表格标题行(`lines1`和`lines2`) - 提取结果均为空(`lines1 num:0, lines2 num:0`) - 导致标题匹配结果为空数组(`title match index tuples: []`) 2. **标题匹配逻辑依赖**: - 定制策略高度依赖表格标题信息进行匹配 - 当无法提取有效标题时,自动回退到默认匹配策略 ### 解决方案:增强表格匹配机制 ```python import re import logging from typing import List, Dict, Tuple, Optional # 日志配置 logger = logging.getLogger(__name__) def match_tables(base_tables: List['TableObject'], target_tables: List['TableObject']) -> List[Tuple[int, int]]: """增强版表格匹配策略""" # 1. 尝试基于标题匹配 title_matches = match_by_title(base_tables, target_tables) if title_matches: logger.info("标题匹配成功") return title_matches # 2. 回退1: 基于表格位置匹配 position_matches = match_by_position(base_tables, target_tables) if position_matches: logger.info("位置匹配成功") return position_matches # 3. 回退2: 基于表格结构匹配 structure_matches = match_by_structure(base_tables, target_tables) if structure_matches: logger.info("结构匹配成功") return structure_matches # 4. 回退3: 基于内容相似度匹配 content_matches = match_by_content(base_tables, target_tables) if content_matches: logger.info("内容匹配成功") return content_matches # 5. 最终回退: 顺序匹配 logger.warning("使用顺序匹配作为最终回退") return match_by_order(base_tables, target_tables) def match_by_title(base_tables: List['TableObject'], target_tables: List['TableObject']) -> List[Tuple[int, int]]: """基于标题的表格匹配""" matches = [] used_targets = set() # 提取所有标题 base_titles = [extract_table_title(table) for table in base_tables] target_titles = [extract_table_title(table) for table in target_tables] # 尝试精确匹配 for i, base_title in enumerate(base_titles): if not base_title: continue for j, target_title in enumerate(target_titles): if j in used_targets or not target_title: continue if base_title == target_title: matches.append((i, j)) used_targets.add(j) break return matches def extract_table_title(table: 'TableObject') -> Optional[str]: """提取表格标题(增强提取逻辑)""" # 1. 检查标题行属性 if hasattr(table, 'header_row') and table.header_row: return table.header_row.text # 2. 检查标题单元格属性 if hasattr(table, 'title_cell') and table.title_cell: return table.title_cell.text # 3. 检查表格上方文本对象 if hasattr(table, 'surrounding_texts'): for text in sorted(table.surrounding_texts, key=lambda t: t.position.y, reverse=True): if text.position.y < table.position.y: # 上方文本 if is_table_title(text.text): return text.text # 4. 检查表格内的第一行 if hasattr(table, 'rows') and table.rows: first_row = table.rows[0] if first_row.cells: return " ".join(cell.text for cell in first_row.cells if cell.text) return None def is_table_title(text: str) -> bool: """判断文本是否为表格标题""" # 标题特征规则 patterns = [ r'表\d+[-.]?\d*[::]?', # 表1: 表1-1, 表1.1 r'Table\s?\d+[-.]?\d*[::]?', # Table 1, Table 1-1 r'^\s*[表表]\s*', # 表:xxx r'(一览表|リスト|明细表|统计表)' ] for pattern in patterns: if re.search(pattern, text, re.IGNORECASE): return True # 检查文本长度(通常标题较短) if len(text.strip()) < 50: return True return False def match_by_position(base_tables: List['TableObject'], target_tables: List['TableObject']) -> List[Tuple[int, int]]: """基于位置的表格匹配(页面坐标)""" matches = [] matched_targets = set() # 按位置排序 sorted_base = sorted(enumerate(base_tables), key=lambda x: (x[1].position.page, x[1].position.y, x[1].position.x)) sorted_target = sorted(enumerate(target_tables), key=lambda x: (x[1].position.page, x[1].position.y, x[1].position.x)) # 匹配最近表格 for (i, base_table), (j, target_table) in zip(sorted_base, sorted_target): # 检查位置接近程度 if abs(base_table.position.y - target_table.position.y) < 100: # 纵坐标差异小于100像素 matches.append((i, j)) matched_targets.add(j) return matches def match_by_structure(base_tables: List['TableObject'], target_tables: List['TableObject']) -> List[Tuple[int, int]]: """基于表格结构的匹配(行列数)""" matches = [] used_targets = set() # 创建结构化签名 base_signatures = [f"{len(table.rows)}x{len(table.columns)}" for table in base_tables] target_signatures = [f"{len(table.rows)}x{len(table.columns)}" for table in target_tables] # 匹配相同结构的表格 for i, base_sig in enumerate(base_signatures): for j, target_sig in enumerate(target_signatures): if j in used_targets: continue if base_sig == target_sig: matches.append((i, j)) used_targets.add(j) break return matches def match_by_content(base_tables: List['TableObject'], target_tables: List['TableObject']) -> List[Tuple[int, int]]: """基于内容相似度的匹配""" # 简化的相似度计算(实际应使用更复杂的算法) matches = [] used_targets = set() for i, base_table in enumerate(base_tables): best_match = None best_similarity = 0 base_content = get_table_content_fingerprint(base_table) for j, target_table in enumerate(target_tables): if j in used_targets: continue target_content = get_table_content_fingerprint(target_table) similarity = calculate_similarity(base_content, target_content) if similarity > best_similarity: best_similarity = similarity best_match = j if best_match is not None and best_similarity > 0.5: matches.append((i, best_match)) used_targets.add(best_match) return matches def get_table_content_fingerprint(table: 'TableObject') -> str: """生成表格内容指纹(简化版)""" if not hasattr(table, 'rows'): return "" # 提取第一列和前两行作为特征 features = [] for ri, row in enumerate(table.rows[:2]): # 最多取前两行 if ri >= len(table.rows): break for ci, cell in enumerate(row.cells[:1]): # 最多取第一列 if ci >= len(row.cells): break features.append(f"{ri}_{ci}_{cell.text[:30]}") # 截取前30字符 return "|".join(features) def calculate_similarity(str1: str, str2: str) -> float: """计算字符串相似度(简化版)""" if not str1 or not str2: return 0.0 set1 = set(str1.split('|')) set2 = set(str2.split('|')) intersection = set1 & set2 return len(intersection) / max(len(set1), len(set2)) def match_by_order(base_tables: List['TableObject'], target_tables: List['TableObject']) -> List[Tuple[int, int]]: """顺序匹配(最终回退策略)""" min_count = min(len(base_tables), len(target_tables)) return [(i, i) for i in range(min_count)] ``` ### 增强匹配策略优势 1. **多级回退机制**: - 标题匹配 → 位置匹配 → 结构匹配 → 内容匹配 → 顺序匹配 - 确保在任何情况下都能提供合理的匹配结果 2. **标题提取增强**: ```python def extract_table_title(table): # 1. 标题行属性 # 2. 标题单元格 # 3. 表格上方文本 # 4. 表格首行内容 ``` 3. **灵活的位置匹配**: ```python def match_by_position(): # 按页面位置排序 # 匹配最接近的表格 ``` 4. **结构匹配保障**: ```python def match_by_structure(): # 使用"行数x列数"作为签名 # 匹配相同结构的表格 ``` 5. **内容相似度备份**: ```python def match_by_content(): # 提取表格内容特征 # 计算相似度 ``` ### 集成到现有系统 ```python # table_strategy.py def match_tables(base_tables, target_tables): """替换原有匹配策略""" try: # 尝试增强版匹配策略 matches = enhanced_match_tables(base_tables, target_tables) if matches: # 记录匹配结果 logger.info(f"匹配成功: base_tables={len(base_tables)}, target_tables={len(target_tables)}") return matches # 回退到默认策略 logger.warning("增强匹配失败,使用默认策略") return default_match_tables(base_tables, target_tables) except Exception as e: logger.error(f"表格匹配失败: {str(e)}") return default_match_tables(base_tables, target_tables) ``` ### 问题预防措施 1. **日志增强**: ```python # 记录匹配失败时的详细信息 if not matches: logger.warning(f"标题匹配失败: base_titles={base_titles}, target_titles={target_titles}") ``` 2. **监控机制**: ```python # 记录匹配策略的使用频率 STRATEGY_STATS = { 'title': 0, 'position': 0, 'structure': 0, 'content': 0, 'order': 0 } def record_strategy_usage(strategy_name): STRATEGY_STATS[strategy_name] += 1 ``` 3. **阈值配置**: ```python # 可配置的匹配阈值 class TableMatchingConfig: MIN_CONTENT_SIMILARITY = 0.5 MAX_POSITION_DIFF = 100 # 像素 REQUIRE_HEADER = False # 是否必须匹配表头 ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值