[1309]MinerU、Magic-PDF、Magic-Doc

最新推荐文章于 2025-03-28 21:06:42 发布

周小董

最新推荐文章于 2025-03-28 21:06:42 发布

阅读量4.2k

点赞数 43

分类专栏： Python前行者文章标签： pdf

本文链接：https://blog.youkuaiyun.com/xc_zhou/article/details/142595524

版权

Python前行者专栏收录该内容

339 篇文章

订阅专栏

文章目录

关于 MinerU

MinerU 是一款一站式、开源、高质量的数据提取工具，主要包含以下功能:

Magic-PDF PDF文档提取
Magic-Doc 网页与电子书提取
github： https://github.com/opendatalab/MinerU/blob/master/README_zh-CN.md

在线体验地址：
https://opendatalab.com/OpenSourceTools/Extractor/PDF
https://www.modelscope.cn/studios/OpenDataLab/MinerU

Magic-PDF

简介
Magic-PDF 是一款将 PDF 转化为 markdown 格式的工具。支持转换本地文档或者位于支持S3协议对象存储上的文件。

主要功能包含：

支持多种前端模型输入
删除页眉、页脚、脚注、页码等元素
符合人类阅读顺序的排版格式
保留原文档的结构和格式，包括标题、段落、列表等
提取图像和表格并在markdown中展示
将公式转换成latex
乱码PDF自动识别并转换
支持cpu和gpu环境
支持windows/linux/mac平台

项目全景

流程图

子模块仓库

PDF-Extract-Kit ：https://github.com/opendatalab/PDF-Extract-Kit
高质量的PDF内容提取工具包

Magic-PDF 上手指南

使用CPU快速体验

1. 安装magic-pdf

conda create -n MinerU python=3.10
conda activate MinerU
pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i https://pypi.tuna.tsinghua.edu.cn/simple

已收到多起由于镜像源和依赖冲突问题导致安装了错误版本软件包的反馈，请务必安装完成后通过以下命令验证版本是否正确

magic-pdf --version

如版本低于0.6.x，请提交issue进行反馈。

完整功能包依赖detectron2，该库需要编译安装，如需自行编译，请参考 facebookresearch/detectron2#5114
或是直接使用我们预编译的whl包(仅限python 3.10)：

pip install detectron2 --extra-index-url https://myhloli.github.io/wheels/

2. 下载模型权重文件

详细参考如何下载模型文件
下载后请将models目录移动到空间较大的ssd磁盘目录

从 ModelScope 下载模型

ModelScope 支持SDK或模型下载，任选一个即可。

1）利用Git lsf下载

git lfs install
git lfs clone https://www.modelscope.cn/opendatalab/PDF-Extract-Kit.git

2）利用SDK下载

# 首先安装modelscope
pip install modelscope

# 使用modelscope sdk下载模型
from modelscope import snapshot_download
model_dir = snapshot_download('opendatalab/PDF-Extract-Kit')
print(f"模型文件下载路径为：{model_dir}/models")

【❗️必须要做❗️】的额外步骤（模型下载完成后请务必完成以下操作）

1.检查模型目录是否下载完整
模型文件夹的结构如下，包含了不同组件的配置文件和权重文件：

./
├── Layout  # 布局检测模型
│   ├── config.json
│   └── model_final.pth
├── MFD  # 公式检测
│   └── weights.pt
├── MFR  # 公式识别模型
│   └── UniMERNet
│       ├── config.json
│       ├── preprocessor_config.json
│       ├── pytorch_model.bin
│       ├── README.md
│       ├── tokenizer_config.json
│       └── tokenizer.json
│── TabRec # 表格识别模型
│   └─StructEqTable
│       ├── config.json
│       ├── generation_config.json
│       ├── model.safetensors
│       ├── preprocessor_config.json
│       ├── special_tokens_map.json
│       ├── spiece.model
│       ├── tokenizer.json
│       └── tokenizer_config.json 
│   └─ TableMaster 
│       └─ ch_PP-OCRv3_det_infer
│           ├── inference.pdiparams
│           ├── inference.pdiparams.info
│           └── inference.pdmodel
│       └─ ch_PP-OCRv3_rec_infer
│           ├── inference.pdiparams
│           ├── inference.pdiparams.info
│           └── inference.pdmodel
│       └─ table_structure_tablemaster_infer
│           ├── inference.pdiparams
│           ├── inference.pdiparams.info
│           └── inference.pdmodel
│       ├── ppocr_keys_v1.txt
│       └── table_master_structure_dict.txt
└── README.md

2.检查模型文件是否下载完整
请检查目录下的模型文件大小与网页上描述是否一致，如果可以的话，最好通过sha256校验模型是否下载完整

3.移动模型到固态硬盘
将 ‘models’ 目录移动到具有较大磁盘空间的目录中，最好是在固态硬盘(SSD)上。
此外在 ~/magic-pdf.json里修改模型的目录指向最终的模型存放位置，否则会报模型无法加载的错误。

3. 拷贝配置文件并进行配置

在仓库根目录可以获得 magic-pdf.template.json 文件

cp magic-pdf.template.json ~/magic-pdf.json

在magic-pdf.json中配置"models-dir"为模型权重文件所在目录

{
  "models-dir": "/tmp/models"
}

❗️务必正确配置模型权重文件所在目录的【绝对路径】，否则会因为找不到模型文件而导致程序无法运行
windows系统中此路径应包含盘符，且需把路径中所有的"\"替换为"/",否则会因为转义原因导致json文件语法错误。

例如：模型放在D盘根目录的models目录，则model-dir的值应为"D:/models"

使用CUDA或MPS加速推理

如您有可用的Nvidia显卡或在使用Apple Silicon的Mac，可以使用CUDA或MPS进行加速

CUDA
需要根据自己的CUDA版本安装对应的pytorch版本
以下是对应CUDA 11.8版本的安装命令，更多信息请参考 https://pytorch.org/get-started/locally/

pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118

同时需要修改配置文件magic-pdf.json中"device-mode"的值

{
  "device-mode":"cuda"
}

MPS
使用macOS(M系列芯片设备)可以使用MPS进行推理加速
需要修改配置文件magic-pdf.json中"device-mode"的值

{
  "device-mode":"mps"
}

使用

命令行

magic-pdf --help
Usage: magic-pdf [OPTIONS]

Options:
  -v, --version                display the version and exit
  -p, --path PATH              local pdf filepath or directory  [required]
  -o, --output-dir TEXT        output local directory
  -m, --method [ocr|txt|auto]  the method for parsing pdf.
                               ocr: using ocr technique to extract information from pdf,
                               txt: suitable for the text-based pdf only and outperform ocr,
                               auto: automatically choose the best method for parsing pdf
                                  from ocr and txt.
                               without method specified, auto will be used by default.
  --help                       Show this message and exit.


## show version
magic-pdf -v

## command line example
magic-pdf -p {some_pdf} -o {some_output_dir} -m auto

其中 {some_pdf} 可以是单个pdf文件，也可以是一个包含多个pdf文件的目录。
运行完命令后输出的结果会保存在{some_output_dir}目录下, 输出的文件列表如下

├── some_pdf.md                          # markdown 文件
├── images                               # 存放图片目录
├── some_pdf_layout.pdf                  # layout 绘图
├── some_pdf_middle.json                 # minerU 中间处理结果
├── some_pdf_model.json                  # 模型推理结果
├── some_pdf_origin.pdf                  # 原 pdf 文件
└── some_pdf_spans.pdf                   # 最小粒度的bbox位置信息绘图

API

处理本地磁盘上的文件

import os
import json

from loguru import logger

from magic_pdf.pipe.UNIPipe import UNIPipe
from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter

import magic_pdf.model as model_config 
model_config.__use_inside_model__ = True

try:
    current_script_dir = os.path.dirname(os.path.abspath(__file__))
    demo_name = "demo1"
    pdf_path = os.path.join(current_script_dir, f"{demo_name}.pdf")
    model_path = os.path.join(current_script_dir, f"{demo_name}.json")
    pdf_bytes = open(pdf_path, "rb").read()
    # model_json = json.loads(open(model_path, "r", encoding="utf-8").read())
    model_json = []  # model_json传空list使用内置模型解析
    jso_useful_key = {"_pdf_type": "", "model_list": model_json}
    local_image_dir = os.path.join(current_script_dir, 'images')
    image_dir = str(os.path.basename(local_image_dir))
    image_writer = DiskReaderWriter(local_image_dir)
    pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
    pipe.pipe_classify()
    """如果没有传入有效的模型数据，则使用内置model解析"""
    if len(model_json) == 0:
        if model_config.__use_inside_model__:
            pipe.pipe_analyze()
        else:
            logger.error("need model list input")
            exit(1)
    pipe.pipe_parse()
    md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
    with open(f"{demo_name}.md", "w", encoding="utf-8") as f:
        f.write(md_content)
except Exception as e:
    logger.exception(e)

Magic-Doc

github：https://github.com/opendatalab/magic-doc

简介

Magic-Doc 是一个轻量级、开源的用于将多种格式的文档（PPT/PPTX/DOC/DOCX/PDF）转化为 markdown 格式的工具。支持转换本地文档或者位于 AWS S3 上的文件

主要功能包含

Web网页提取
- 跨模态精准解析图文、表格、公式信息
电子书文献提取
- 支持 epub，mobi等多格式文献，文本图片全适配
语言类型鉴定
- 支持176种语言的准确识别

安装

前置依赖： python3.10
安装依赖

linux/osx

apt-get/yum/brew install libreoffice

windows

安装 libreoffice 
添加 "install_dir\LibreOffice\program" to 环境变量 PATH

安装 Magic-Doc

pip install fairy-doc[cpu] # 安装 cpu 版本 
或 
pip install fairy-doc[gpu] # 安装 gpu 版本

使用示例

# for local file
from magic_doc.docconv import DocConverter, S3Config
converter = DocConverter(s3_config=None)
markdown_content, time_cost = converter.convert("some_doc.pptx", conv_timeout=300)

# for remote file located in aws s3
from magic_doc.docconv import DocConverter, S3Config

s3_config = S3Config(ak='${ak}', sk='${sk}', endpoint='${endpoint}')
converter = DocConverter(s3_config=s3_config)
markdown_content, time_cost = converter.convert("s3://some_bucket/some_doc.pptx", conv_timeout=300)