基于python的.txt与.ann数据对集合转json/jsonl数据集

本文链接：https://blog.youkuaiyun.com/2201_75499442/article/details/139340060

一、代码背景

鉴于我们爬取的数据集为来自各个不同医学文献的摘要，且格式为.txt文件，同时我们的标注文件为.ann格式文件，在模型训练方面这两类数据集并不具有通用性。与此同时，在了解认识市面上开源或闭源的有关模型时，也是发现其数据集或测试集均呈现为json/jsonl的格式，因此我们需要代码能够遍历我们的.txt与.ann数据对集合，以此将已有数据集转变成更加适合大模型训练的数据集，即json/jsonl数据集，数据格式参考下图：

上图是为微调开源的llama 3模型所设定的数据集，其格式为具备“instruction”、“input”与“output”的json格式。

同时，对于百度千帆平台上提供的SFT调优方式，需要使用Prompt+Response数据格式，也是可以利用jsonl格式的文件直接上传的。

二、代码实现

直接附上data2json的代码：

import os
import json


def convert_files_to_json(folder_path, output_file):
    data = []

    # 获取文件夹中的所有文件
    files = os.listdir(folder_path)

    # 找到所有的.txt文件，并确保有相应的.ann文件
    txt_files = [f for f in files if f.endswith('.txt')]
    ann_files = [f for f in files if f.endswith('.ann')]

    for txt_file in txt_files:
        base_name = os.path.splitext(txt_file)[0]
        ann_file = base_name + '.ann'

        if ann_file in ann_files:
            # 读取.txt文件内容
            with open(os.path.join(folder_path, txt_file), 'r', encoding='utf-8') as file:
                instruction = file.read()

            # 读取.ann文件内容
            with open(os.path.join(folder_path, ann_file), 'r', encoding='utf-8') as file:
                output = file.read()

            # 构造json数据
            json_data = {
                "instruction": instruction,
                "input": "",
                "output": output
            }
            data.append(json_data)

    # 将所有数据写入输出的json文件
    with open(output_file, 'w', encoding='utf-8') as outfile:
        json.dump(data, outfile, ensure_ascii=False, indent=4)


# 使用示例
folder_path = r'C:\Users\11746\Desktop\PubMed\dataset'  # 替换为你的文件夹路径
output_file = 'output.json'  # 输出json文件名
convert_files_to_json(folder_path, output_file)

对于data2jsonl的代码，仅需要在上述代码的基础上修改json数据构造与将数据写出的代码块，具体修改内容如下：

# 构造json数据
            json_data = {
                "prompt": prompt,
                "response": response
            }
            data.append(json_data)

# 将所有数据写入输出的jsonl文件
    with open(output_file, 'w', encoding='utf-8') as outfile:
        for entry in data:
            json.dump(entry, outfile, ensure_ascii=False)
            outfile.write('\n')

最后要注意输出文件时，要将文件名的后缀修改为.jsonl即可。

三、结果呈现

最后呈现出的结果就和文章开头给出的参考图片格式一致，能够将我们已有的.txt和.ann中的内容分别加入到json文件内部，如下图所示。

output.json

output.jsonl

四、代码分析

以data2json代码为例，这段 Python 代码的目的是将一个文件夹中的所有 .txt 文件和相应的 .ann 文件合并，并转换为 JSON 格式的数据，最后将这些数据保存到一个 JSON 文件中。代码的工作流程如下：

1.导入必要的模块:

os: 用于处理文件和目录路径。

json: 用于处理 JSON 数据的读写

import os
import json

2.定义函数 `convert_files_to_json`:

接收两个参数：folder_path: 包含 .txt 和 .ann 文件的文件夹路径。output_file: 输出的 JSON 文件路径。

3.初始化空列表 `data：`

data = []

4.获取文件夹中的所有文件:

# 获取文件夹中的所有文件
    files = os.listdir(folder_path)

5.筛选 `.txt` 文件和 `.ann` 文件:

# 找到所有的.txt文件，并确保有相应的.ann文件
    txt_files = [f for f in files if f.endswith('.txt')]
    ann_files = [f for f in files if f.endswith('.ann')]

6.处理每一个 `.txt` 文件:

base_name = os.path.splitext(txt_file)[0]: 获取文件的基本名称（不带扩展名）。

ann_file = base_name + '.ann': 构建对应的 .ann 文件名。

    for txt_file in txt_files:
        base_name = os.path.splitext(txt_file)[0]
        ann_file = base_name + '.ann'

7.检查是否存在对应的 `.ann` 文件:

if ann_file in ann_files:

8.读取文件内容:

instruction: 读取 .txt 文件的内容。

output: 读取 .ann 文件的内容

9.构造 JSON 数据:

构造一个包含 instruction、input 和 output 字段的字典，并将其添加到 data 列表中。这里的 input 字段被设为空字符串。

# 构造json数据
            json_data = {
                "instruction": instruction,
                "input": "",
                "output": output
            }
            data.append(json_data)

10.将数据写入输出 JSON 文件:

with open(output_file, 'w', encoding='utf-8') as outfile: 打开输出文件（如果文件不存在则创建），并写入 JSON 数据。

json.dump(data, outfile, ensure_ascii=False, indent=4): 将 data 列表中的数据以 JSON 格式写入文件，使用 ensure_ascii=False 来确保非 ASCII 字符能够正确写入，indent=4 使得输出的 JSON 文件格式化。

# 将所有数据写入输出的json文件
    with open(output_file, 'w', encoding='utf-8') as outfile:
        json.dump(data, outfile, ensure_ascii=False, indent=4)