Docker使用MinerU

Docker使用MinerU

1 介绍

MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。效果不错,但是有点慢。

# 官网地址
https://opendatalab.github.io/MinerU/zh/

# Github地址
https://github.com/opendatalab/mineru

2 构建模型

基于vllm-openai:v0.10.1.1和Dockerfile构建镜像。

下载vllm-openai:v0.10.1.1镜像

docker pull vllm/vllm-openai:v0.10.1.1

Dockerfile内容(可以从官网上下载

# Use DaoCloud mirrored vllm image for China region for gpu with Ampere architecture and above (Compute Capability>=8.0)
# Compute Capability version query (https://developer.nvidia.com/cuda-gpus)
FROM docker.m.daocloud.io/vllm/vllm-openai:v0.10.1.1

# Use the official vllm image
# FROM vllm/vllm-openai:v0.10.1.1

# Use DaoCloud mirrored vllm image for China region for gpu with Turing architecture and below (Compute Capability<8.0)
# FROM docker.m.daocloud.io/vllm/vllm-openai:v0.10.2

# Use the official vllm image
# FROM vllm/vllm-openai:v0.10.2

# Install libgl for opencv support & Noto fonts for Chinese characters
RUN apt-get update && \
    apt-get install -y \
        fonts-noto-core \
        fonts-noto-cjk \
        fontconfig \
        libgl1 && \
    fc-cache -fv && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Install mineru latest
RUN python3 -m pip install -U 'mineru[core]' -i https://mirrors.aliyun.com/pypi/simple --break-system-packages && \
    python3 -m pip cache purge

# Download models and update the configuration file
RUN /bin/bash -c "mineru-models-download -s modelscope -m all"

# Set the entry point to activate the virtual environment and run the command line tool
ENTRYPOINT ["/bin/bash", "-c", "export MINERU_MODEL_SOURCE=local && exec \"$@\"", "--"]

构建镜像

docker build -t mineru:2.6.4 -f Dockerfile .

构建容器

docker run -itd \
--name mineru \
--gpus all \
--shm-size 32g \
-p 30000:30000 \
-p 7860:7860 \
-p 8000:8000 \
--ipc=host \
-it mineru:2.6.4 \
mineru-vllm-server --port 30000

3 使用Python调用

可参考官网的Python进行修改。

Python代码

import os
from pathlib import Path

from mineru.backend.vlm.vlm_middle_json_mkcontent import union_make
from mineru.backend.vlm.vlm_analyze import doc_analyze

from mineru.utils.enum_class import MakeMode
from mineru.cli.common import read_fn, convert_pdf_bytes_to_bytes_by_pypdfium2, prepare_env
from mineru.data.data_reader_writer import FileBasedDataWriter


def test(pdf_path: Path, pdf_file_name: str, output_dir: Path, server_url):
    # 解析方法
    parse_method = "vlm"
    backend = "http-client"

    # 读取文件流
    pdf_bytes = read_fn(pdf_path)
    pdf_bytes = convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes)

    # 预处理文件
    local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name, parse_method)
    image_writer = FileBasedDataWriter(local_image_dir)
    md_writer = FileBasedDataWriter(local_md_dir)

    # 解析文件
    middle_json, infer_result = doc_analyze(pdf_bytes, image_writer=image_writer, backend=backend, server_url=server_url)

    # 获取结果
    print(middle_json)
    print(infer_result)

    # 获取解析的信息
    pdf_info = middle_json["pdf_info"]

    # 解析Markdown文件
    f_make_md_mode = MakeMode.MM_MD
    image_dir = str(os.path.basename(local_image_dir))
    md_content_str = union_make(pdf_info, f_make_md_mode, image_dir)
    
    # Markdown文本
    print(md_content_str)

    # 存储文件
    md_writer.write_string(
        f"{pdf_file_name}.md",
        md_content_str,
    )


if __name__ == '__main__':
    pdf_input_temp = "E:/test/input/test1.pdf"
    test(pdf_path=Path(pdf_input_temp), pdf_file_name="test1", output_dir=Path("E:/test/output"), server_url="http://192.168.0.104:30000")

4 执行结果

服务器结果
在这里插入图片描述

Python代码执行结果

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值