Uniflow 项目使用指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01149/article/details/142808496

Uniflow 项目使用指南

uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering LLM-based text extraction from unstructured data like PDFs, Words and HTMLs. Transform and cluster the text into your desired format. Less information loss, more interpretation, and faster R&D! 项目地址: https://gitcode.com/gh_mirrors/un/uniflow-llm-based-pdf-extraction-text-cleaning-data-clustering

1. 项目介绍

Uniflow 是一个基于大型语言模型（LLM）的统一接口，用于从非结构化数据（如PDF、Word和HTML文件）中提取文本、清理数据并进行数据聚类。Uniflow 支持多种常见的LLM模型，包括OpenAI的GPT系列、Google的Gemini模型、AWS的BedRock模型以及Huggingface的开源模型等。

Uniflow 主要解决了两个关键问题：

从复杂的PDF和Word文件中提取干净文本的难题。
将提取的数据转换为适合LLM训练的格式，支持反馈式学习技术。

2. 项目快速启动

安装步骤

创建Conda环境：

conda create -n uniflow python=3.10 -y
conda activate uniflow

安装PyTorch：

如果你使用GPU：

pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121

如果你使用CPU：
```
pip3 install torch
```

安装Uniflow：
```
pip3 install uniflow
```
设置OpenAI API密钥（可选）：在项目根目录下创建一个 .env 文件，并添加以下内容：
```
OPENAI_API_KEY=YOUR_API_KEY
```
安装其他依赖（可选）：
- 如果你使用Huggingface模型：
```
pip3 install transformers accelerate bitsandbytes scipy
```
- 如果你使用LMQG模型：
```
pip3 install lmqg spacy
```

快速启动示例

以下是一个简单的示例，展示如何使用Uniflow从PDF文件中提取文本并生成问题和答案：

from uniflow.op.prompt import Context
from uniflow.client import run

# 创建Context对象
data = [
    Context(context="The quick brown fox jumps over the lazy brown dog.")
]

# 运行Uniflow
client.run(data)