一、数据json化处理python脚本
具体内容与json化格式可以参考组内毛德霖同学的csdn博客
定义医学文献信息提取的输出形式:JSON与表格化模式-优快云博客
但是我们在他的脚本上做了部分修改,修改内容具体为:由于其脚本生成的为单个文献所产生的提取结果,用来作为GPTS的训练数据,但并不支持我们已经完善的批量化的数据集,因此修改了代码的文件输入输出,能够实现对文献.txt与标注的.ann文件对的处理,具体修改见下方代码中的读取文件函数:
import json
import os
def generate_training_data_json(article, ann):
# Extract entities from ann file
entities = {}
for line in ann.strip().split("\n"):
if line.startswith("T"):
parts = line.split("\t")
entity_id = parts[0]
entity_info = parts[1]
entity_text = parts[2]
label, start, end = entity_info.split(" ")
entities[entity_id] = {"label": label.lower(), "start": int(start), "end": int(end), "text": entity_text}
# Extract events from ann file
events = {}
for line in ann.strip().split("\n"):
if line.startswith("E"):
parts = line.split("\t")
event_id = parts[0]
event_info = parts[1]
main_entity, *related_entities = event_info.split(" ")
main_entity_id = main_entity.split(":")[1]
event_data = {"main": entities[main_entity_id], "related": []}
for related_entity in related_entities:
if ":" in related_entity: # Check if the related_entity contains a colon
label, entity_id = related_entity.split(":")
event_data["related"].append({"label": label.lower(), "entity": entities[entity_id]})
events[event_id] = event_data
# Generate fixed data
fixed_data = {
"total-participants": "",
"intervention-participants": "",
"control-participants": "",
"age": [],
"intervention-age": "",
"control-age": "",
"eligibility