pdf文件中的表格无损提取方案（pdf转Excel），非OCR

liuyouzhang

于 2024-12-19 16:36:08 发布

阅读量1k

点赞数 3

文章标签： pdf excel

本文链接：https://blog.youkuaiyun.com/liuyouzhang89/article/details/144586393

版权

非OCR方案，基于java：

aspose 21.11版本（网上有破解方法，或者参考我另外一篇文章）

转换pdf（含表格）为excel文件，然后可以使用poi对得到的excel文件进行微调。

但是上述方案，无法解决pdf的表格中，有比较多的横向、纵向的合并单元格的情况，例如下图这种pdf中的复杂表格

网上找到github上一位大拿的方法，使用python，对表格数据进行了识别，识别边框，单元格，然后重新构造出原始的表格内容，包括合并单元格的信息（这也会导致表格的样式，尤其是列宽和行高，并不能完全跟原表格保持一致，这里重点关注单元格和单元格的数据）

识别算法中，对原始（未进行合并单元格之前）的单元格进行恢复，识别出的图片示意图，如下

最后会构造合并单元格，然后这里的例子是输出为图片的，若是输出为excel或其他的表格格式的数据，还需要做处理，这里未给出具体实现代码。但是就这个能够完全还原出原来的复杂表格的单元格和数据，就感觉已经非常NB了。

python实现的，单元格合并识别算法参考，输出图片格式

Handling merged cells (possible solution) · Issue #84 · jsvine/pdfplumber · GitHub

识别算法图示：

使用如下的字符，来标记当前单元格的边框，上下左右四个角，以及临近单元格的信息（向哪个方向延展可以得到下一个单元格），定义这种数据结构，实现对于表格数据结构的定义和存储

上述定义的，不同数据的图示

主要代码：

https://github.com/shuratn/py_pdf_stm/blob/master/TableExtractor.py

import math
from operator import itemgetter

import pdfplumber
from PIL import ImageDraw, ImageFont, Image
from pdfplumber.table import TableFinder

from DataSheetParsers.DataSheet import *


def almost_equals(num1, num2, precision=5.0):
    return abs(num1 - num2) < precision


class Point:
    r = 4
    hr = r / 2
    tail = 5

    def __init__(self, *xy):
        if len(xy) == 1:
            xy = xy[0]
        self.x, self.y = xy
        self.x = math.ceil(self.x)
        self.y = math.ceil(self.y)
        self.down = False
        self.up = False
        self.left = False
        self.right = False

    @property
    def symbol(self):
        direction_table = {
            (False, False, False, False): '◦',

            (True, False, False, False): '↑',
            (False, True, False, False): '↓',
            (True, True, False, False): '↕',

            (True, True, True, False): '⊢',
            (True, True, False, True): '⊣',

            (False, False, True, False): '→',
            (False, False, False, True): '←',
            (False, False, True, True): '↔',

            (True, False, True, True): '⊥',
            (False, True, True, True): '⊤',

            (True, True, True, True): '╋',

            (True, False, True, False): '┗',
            (True, False, False, True): '┛',

            (False, True, True, False): '┏',
            (False, True, False, True): '┛',

        }
        return direction_table[(self.up, self.down, self.right, self.left)]

    def __repr__(self):
        return "Point<X:{} Y:{}>".format(self.x, self.y)

    def distance(self, other: 'Point'):
        return math.sqrt(((self.x - other.x) ** 2) + ((self.y - other.y) ** 2))

    @property
    def as_tuple(self):
        return self.x, self.y

    def draw(self, canvas: ImageDraw.ImageDraw, color='red'):
        canvas.ellipse((self.x - self.hr, self.y - self.hr, self.x + self.hr, self.y + self.hr), fill=color)
        if self.down:
            canvas.line(((self.x, self.y), (self.x, self.y + self.tail)), 'blue')
        if self.up:
            canvas.line(((self.x, self.y), (self.x, self.y - self.tail)), 'blue')
        if self.left:
            canvas.line(((self.x, self.y), (self.x - self.tail, self.y)), 'blue')
        if self.right:
            canvas.line(((self.x, self.y), (self.x + self.tail, self.y)), 'blue')

    def points_to_right(self, other_points: List['Point']):
        sorted_other_points = sorted(other_points, key=lambda other: other.x)
        filtered_other_points = filter(lambda o: almost_equals(o.y, self.y) and o != self and o.x > self.x,
                                       sorted_other_points)
        return list(filtered_other_points)

    def points_below(self, other_points: List['Point']):
        sorted_other_points = sorted(other_points, key=lambda other: other.y)
        filtered_other_points = filter(lambda o: almost_equals(o.x, self.x) and o != self and o.y > self.y,
                                       sorted_other_points)
        return list(filtered_other_points)

    def on_same_line(self, other: 'Point'):
        if self == other:
            return False
        if almost_equals(self.x, other.x) or almost_equals(self.y, other.y):
            return True
        return False

    def is_above(self, other: 'Point'):
        return self.y < other.y

    def is_to_right(self, other: 'Point'):
        return self.x > other.x

    def is_below(self, other: 'Point'):

最低0.47元/天解锁文章