基于PPStructureV2实现pdf的表格识别

WIZERS

已于 2024-11-20 20:43:38 修改

阅读量5.5k

点赞数 41

文章标签： pdf python 算法

于 2024-01-30 17:22:31 首次发布

本文链接：https://blog.youkuaiyun.com/WIZERS/article/details/135620552

版权

本文探讨了PPStructureV2，由百度PaddleOCR团队开发的表格识别工具，对比了其与pdfplumber的性能。虽然PPStructureV2在复杂表格识别上表现出色，但对小型表格识别有误，且速度较慢。pdfplumber速度快但存在识别误差。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

基于PPStructureV2识别PDF中表格

前言

PPStructureV2由百度的PaddleOCR团队开发，获得了当前文档/图片中表格识别的最佳性能(State-of-the-art, SOTA)。github项目地址，论文地址。

本文是从应用角度出发，所以不会讲太多原理上的东西。

一、需解决的问题

定位文档中表格所在区域，然后将表格保存为图片
识别表格的内容

我这边需要表格的图片作为输出，如果你仅需要识别表格内容，那么你的需求可以将定位表格和内容识别合并为一个需求，下面将会有方案。

二、实现步骤

1. 抛砖引玉——pdfplumber

pdfplumber是目前一般项目和实验中最常用的pdf识别表格的算法模块，现在先讲解它来对比PPStructureV2的强大。pdfplumber的优点是使用简单，识别速度快，接口成熟稳定，安装方式为：

pip install pdfplumber

下面给出一个简单的实现提取文件表格的代码，一个extract_table就可以获取页面中表格的内容，返回值是二维数组的list类型，可以很方便地使用dataframe转为csv

import pdfplumber

pdf_path = "files/1.pdf"

with pdfplumber.open(pdf_path) as pdf:
    for i in range(len(pdf.pages)):
        page = pdf.pages[i]
        tables = page.extract_table()

但是pdfplumber的缺点也比较多：

pdfplumber是按行读取的，所以不支持多行单元格识别
对于多表头等复杂表格支持度不够
存在错字、漏字现象

2.PP-StructureV2

与pdfplumber是一个python工具库不同，PP-StructureV2依托于paddlepaddle这个飞浆平台的框架（与pytorch，tensorflow一个级别），并且使用模型进行推理得到文档内容识别结果。所以PP-StructureV2的适用性更强，可支持多表头、合并单元格等特殊情况，但相应的也会带来学习成本高，文档处理时间偏长，部分表格识别不够准确的问题。
首先配置环境可以看我这篇文章，踩了很多坑~~~

https://blog.youkuaiyun.com/WIZERS/article/details/135936636?spm=1001.2014.3001.5501

下面是官网的版面分析+表格识别代码：

table_engine = PPStructure(show_log=True)

save_folder = './output'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
# 保存结果到输出目录的一个文件中，这个文件夹和输入同名，并包含页面识别的表格和html结果
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

其实这个库将所有的方法都封装在PPStructure()函数中，通过控制不同的参数实现不同的功能。
PPStructure的输入是RGB格式的图片，输出结果是一个列表，每个列表元素都是一个字典，以识别的表格为例：
[[{‘type’: ‘table’, ‘bbox’: […], ‘img’: array([[[252, 252, 2…ype=uint8), ‘res’: {…}, ‘img_idx’: 0}]]
所以如果要处理pdf文件，官网给出了接口，参考版面恢复，
这边给出我使用官方的pdf接口实现代码：

import subprocess
import time

# 定义参数字典
args = {
    '--image_dir': 'files.PDF',
    '--det_model_dir': '/root/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer', 
    '--rec_model_dir': '/root/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer',
    '--rec_char_dict_path': '/workspace/PPStructure/PaddleOCR-release-2.6/ppocr/utils/ppocr_keys_v1.txt',
    '--table_model_dir': '/root/.paddleocr/whl/table/ch_ppstructure_mobile_v2.0_SLANet_infer',
    '--table_char_dict_path': '/workspace/PPStructure/PaddleOCR-release-2.6/ppocr/utils/dict/table_structure_dict_ch.txt',
    '--layout_model_dir': '/root/.paddleocr/whl/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer',
    '--layout_dict_path': '/workspace/PPStructure/PaddleOCR-release-2.6/ppocr/utils/dict/layout_dict/layout_cdla_dict.txt',
    '--vis_font_path': 'font/simfang.ttf',
    '--recovery': 'True',
    '--output': '/workspace/PPStructure/output'
}

def run_predict_system(args_dict):
    # 将参数字典中的键值对转换为命令行参数列表
    command = ['python3', '/workspace/PPStructure/PaddleOCR-release-2.6/ppstructure/predict_system.py']
    for key, value in args_dict.items():
        if not key.startswith('--'):
            raise ValueError(f"Invalid parameter '{key}' without '--' prefix")
        command.extend([key, str(value)])

    # 执行命令
    subprocess.run(command, check=True)

start_time = time.time()

# 调用函数并传入参数字典
run_predict_system(args)

# 输出运行时间，时-分-秒
print(time.strftime("%H:%M:%S", time.gmtime(time.time() - start_time)))

其中参数的含义如图：
在这里插入图片描述
模型下载链接
需要注意的是，用官网这个pdf接口进行处理需要拉取官网源码进行一些修改才能运行。
例如，我运行官方接口时遇到的一些问题：

显示没有安装 Polygon包

解决方案是安装时加上版本号：pip install Polygon3

AttributeError: module ‘paddle’ has no attribute ‘fluid’. Did you mean: ‘flip’?

因为paddle2.5后已弃用fluid，修改“PaddleOCR-release-2.6/tools/infer/utility.py”的get_infer_gpuid()函数为：

def get_infer_gpuid():
    sysstr = platform.system()
    if sysstr == "Windows":
        return 0

    if not paddle.device.is_compiled_with_rocm:
        cmd = "env | grep CUDA_VISIBLE_DEVICES"
    else:
        cmd = "env | grep HIP_VISIBLE_DEVICES"
    env_cuda = os.popen(cmd).readlines()
    if len(env_cuda) == 0:
        return 0
    else:
        gpu_id = env_cuda[0].strip().split("=")[1]
        return int(gpu_id[0])

AttributeError: ‘FreeTypeFont’ object has no attribute ‘getsize’

同样问题，因为方法已弃用而官方文档没有及时更新。解决方法安装9.5.0及以下版本即可：pip install Pillow==9.5.0

但是“paddleocr 2.7.0.3 requires Pillow>=10.0.0”这个需要注意

但是这个实现速度实在是太慢了，我识别160页的pdf用了整整30分钟，当然也可能是由于我当时跑代码用的是paddlepaddle的cpu版，但是这个速度也实在不能接受。

3.pdfplumber+ PPStructure

我观察到现有pdf文件中并不是每页都有表格的，而pdfplumber的识别速度要快多了，所以可以先用pdfplumber对每页文档识别是否有表格，如果有表格就剪切该页文档为图片（也可以只剪切表格区域），然后使用PPStructure进行识别内容。由于需求，我需要识别出分页表格，所以其中还加入了对于分页表格的识别，主要思想是如果表格是该页面最后一个表格就加入临时列表，如果该表格是页面第一个且临时列表不空就合并表格。至于判断表格是否是最后一个，可以使用官方的bbox排序函数sorted_layout_boxes，但这里为了简单就用与页边界的相对位置进行判断。

说明：假如pdf的页面bbox为(0, 0, 595.2, 842.04)，代表含义分别是：

左上角的 x 坐标
左上角的 y 坐标
右下角的 x 坐标
右下角的 y 坐标

以下是实现代码：
main.py

import os
from paddleocr import PPStructure, save_structure_res
import pdfplumber
import numpy as np
import time
from table_methods import concatenate_images, concatenate_excel


TYPE = 3  # 1 is table, 2 is det, 3 is det + table
dpi = 300 # PPStructure识别结果需要一定的图片分辨率
file_path = 'files.pdf'
save_folder = 'output'

def processer(TYPE, file_path):
    table_engine = {
        1: PPStructure(layout=False, show_log=True),
        2: PPStructure(table=False, ocr=False, show_log=True),
        3: PPStructure(show_log=True)
    }.get(TYPE)
    if table_engine is None:
        print("TYPE is wrong!")
        exit(0)
    
    with pdfplumber.open(file_path) as pdf:
        temp_table = []
        data = []
        num = 1
        num_pages = len(pdf.pages)
        for page in pdf.pages:
            page_number = page.page_number

            tables = page.find_tables()
            if not tables:
                continue

            img = page.to_image(resolution=dpi)
            img = np.array(img.original)
            result = table_engine(img)
            save_structure_res(result, save_folder, f"page_{page_number}")
            txts = [res.get('res').get('html') for res in result if res.get("type") == 'table']
            data.extend(txts)
			
			# 处理分页表格并保存
            table_index = table_index_in_result(result)
            # footer_top = get_footer_bbox(result)[1]
            for i in range(len(tables)):
                table_object = result[table_index[i]]
                # 根据分辨率进行缩放，因为pdfplumber默认分辨率为72
                table_bbox = [x / (dpi / 72) for x in table_object.get('bbox')]
                table_img = page.crop(table_bbox).to_image(resolution=dpi)
				
                if i == 0:
                    if temp_table:
                        (last_table_img, last_table_bbox) = temp_table.pop()
                        if table_bbox[1] - 75 < page.bbox[1]:
                            table_img = concatenate_images(last_table_img.original, table_img.original)
                            
                            last_table_bbox = [int(x * dpi / 75) for x in last_table_bbox]
                            last_table_path = os.path.join(save_folder, f"page_{page_number - 1}", f"{last_table_bbox}_0.xlsx")
                            now_table_path = os.path.join(save_folder, f"page_{page_number}", f"{table_object.get('bbox')}_0.xlsx")
                            concated_table_path = os.path.join(save_folder, f"page_{page_number}", f"{table_object.get('bbox')}_0.xlsx")
                            concatenate_excel(last_table_path, now_table_path, concated_table_path)
                            # 删除上一个表的excel文件
                        else:
                            last_table_path = os.path.join(save_folder, f"page_{page_number - 1}", f"img_{num}.png")
                            last_table_img.save(last_table_path)
                            num += 1
                            

                if i == (len(tables) - 1) and page_number != num_pages:
                    if table_bbox[3] + 82 > page.bbox[3]:
                        temp_table.append((table_img, table_bbox))
                        continue
                
                table_path = os.path.join(save_folder, f"page_{page_number}", f"img_{num}.png")
                table_img.save(table_path)
                num += 1

        with open("output/result.txt", 'w', encoding='utf-8')as f:
            for line in data:
                f.write(line + '\n')



def table_index_in_result(table_engine_result):
    res = []
    for index, value in enumerate(table_engine_result):
        if value.get("type") == 'table':
            res.append(index)
    return res

if __name__ == '__main__':
    start_time = time.time()
    processer(TYPE, file_path)
    print('used time: ', time.time() - start_time)

table_methods.py

from PIL import Image
import pandas as pd


def concatenate_images(image1, image2, direction='vertical'):
    """
    将两个图片拼接在一起。

    Parameters:
    - image1 (array): 第一张图片的源数据
    - image2 (array): 第二张图片的源数据
    - direction (str): 拼接方向，'horizontal'表示水平拼接，'vertical'表示垂直拼接，默认为水平拼接
    """
    # 打开两张图片
    # image1 = Image.open(image1)
    # image2 = Image.open(image2)

    # 获取图片大小
    width1, height1 = image1.size
    width2, height2 = image2.size

    # 水平拼接
    if direction == 'horizontal':
        new_width = width1 + width2
        new_height = max(height1, height2)
        new_image = Image.new('RGB', (new_width, new_height))
        new_image.paste(image1, (0, 0))
        new_image.paste(image2, (width1, 0))
    # 垂直拼接
    elif direction == 'vertical':
        new_width = max(width1, width2)
        new_height = height1 + height2
        new_image = Image.new('RGB', (new_width, new_height))
        new_image.paste(image1, (0, 0))
        new_image.paste(image2, (0, height1))
    else:
        raise ValueError("Invalid direction. Use 'horizontal' or 'vertical'.")

    # 保存拼接后的图片
    return new_image


def concatenate_excel(file1_path, file2_path, output_path):
    # 读取两个Excel文件
    df1 = pd.read_excel(file1_path)
    df2 = pd.read_excel(file2_path, header=None)

    header = df1.columns.tolist()
    df2.columns = header
    # 拼接两个DataFrame
    result_df = pd.concat([df1, df2], ignore_index=True)

    # 将结果保存到新的Excel文件
    result_df.to_excel(output_path, index=False)


def table_standardization(excel_path):
    df = pd.read_excel(excel_path, header=[0, 1])
    
    cols = df.columns.map(lambda x: ''.join('' if 'Unnamed' in i else i for i in x))
    df.columns = cols
    df.to_excel(excel_path, index=False)



if __name__ == '__main__':
    # image1_path = "output/img_7/img_1.png"
    # image2_path = "output/img_8/img_2.png"
    # output_path = "output/concat.png"

    # img = concatenate_images(image1_path, image2_path, output_path, direction='vertical')
    # img.save(output_path)

    # # 示例：拼接两个Excel文件
    # file1_path = 'output/page_8/[5, 0, 2076, 2770]_0.xlsx'
    # file2_path = 'output/page_8/[5, 2895, 2074, 3551]_0.xlsx'
    # output_path = 'output/output_file.xlsx'

    # concatenate_excel(file1_path, file2_path, output_path)
    file1_path = 'output/多表头示例/[1, 97, 999, 268]_0.xlsx'
    table_standardization(file1_path)

4.效果展示

这是某个表格的输出目录结构：

这是结果中对原图片剪切后：
请添加图片描述
这是识别后excel：
在这里插入图片描述

总结

经过我多次测试，发现PPStructure的效果并没有想象中那么好，它会忽略或误判小型表格（只有两三行的），可能会识别为“figure”或“text”，而且时间性能也很长。但是对于大型表格、多表头、合并单元格的识别确实相当惊艳，并且可以识别双列文档。
与之相比，pdfplumber的处理速度很快，不管对于大型表格还是小型表格都能比较准确的识别出来，并且可以获取其中内容，但是会出现错字、漏字情况（尤其是数组、英文字符），而且由于它是逐行OCR识别，所以无法识别合并单元格、无法处理双列文档
以上就是今天要讲的内容，本文仅仅简单介绍了pdfplumber和PPStructure的使用，本人也是刚刚接触这些，如果有错误和纰漏欢迎指正。