rasa框架nlu文件自动生成—基于doccano

本文链接：https://blog.youkuaiyun.com/comscience/article/details/129957034

文章介绍了如何利用开源工具Doccano进行数据标注，然后通过编写脚本将标注后的数据转换并整合到nlu.yml文件中，用于RasaNLU的自然语言理解。这个过程包括数据的导入、标注、导出以及jsonl到nlu.yml的转换。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

在这里插入图片描述

Doccano是一款开源的数据标注工具，在本地或服务器上安装运行之后，可直接在web上访问，导入数据进行标注。我们选用其作为图谱问答数据nlu.yml的标注工作。但是从数据标注，再到完整的nlu.yml文件，需要经过一系列的转换过程，因此本文在这项工作上进行总结探索。
基于Doccano实现从数据标注到nlu.yml自动化生成分为如下几步：

将需要标注的数据按照意图区分，在doccano构建不同的数据集进行标注；
标注完成后，将数据导出，文件格式为jsonl，文件命令为entity_intent.jsonl、relation_intent.jsonl、attr_intent.jsonl，放于raw_data目录下；
将上述文件放入与脚本相同的目录下，运行脚本，得到追加新数据的nlu.yml文件

数据标注

使用doccano进行数据标注，与传统的数据标注平台的标注方式类似，需要完成导入数据，规定标签，标注数据，导出数据等基本操作。

脚本处理

读入nlu.yml文件数据，使用convert_training_data将其转换为json；
读入jsonl格式的标注数据，进行格式处理，追加至nlu.json中；
将nlu.json转换回nlu.yml，完成数据标注

# doccano脚本
import os
import jsonlines
import json
from rasa.nlu.convert import convert_training_data

# 实体对应多槽位的标签
has_role = ["object_label", "object_label2", "object_device", "object_device2"]


def nlu_json(input_file, output_file, final_format):
    convert_training_data(data_file=input_file, out_file=output_file, output_format=final_format, language="zh")


def save_json(para_mapper_dict, path):
    # 以utf-8格式将字典写为json文件
    with open(path, 'w', encoding='utf-8') as json_file:
        json.dump(para_mapper_dict, json_file, ensure_ascii=False)


def add_data(read_paths):
    # 原始的nlu文件路径
    input_file = '../data/nlu.yml'
    # 中转数据以json格式存储
    output_file = 'nlu.json'
    # nlu转为json，便于读入
    nlu_json(input_file, output_file, "json")

    with open(output_file, 'r', encoding='utf-8') as f:
        tot_data = json.load(f)
        # 显示初始的标注数据条数
        print(len(tot_data["rasa_nlu_data"]["common_examples"]))
        # 读入不同意图下的标注数据
        for read_path in read_paths:
            with jsonlines.open(read_path, "r") as rfd:
                # 从文件名中获取意图名称
                intent_name = os.path.basename(read_path).replace(".jsonl", "")
                # 处理意图下的数据，将其转换为标准格式追加至原有数据上
                for data in rfd:
                    # 处理每条数据
                    text = data["text"]
                    tmp_dict = {"text": text, "intent": intent_name}
                    tmp_dict["entities"] = []
                    for la in data["label"]:
                        l, r, role = la
                        value = ""
                        # 根据左右索引取得待标注的实体
                        for i in range(l, r):
                            value += text[i]
                        entity = role
                        # entity与role并不等价
                        if entity[len(entity) - 1].isdecimal():
                            entity = entity.replace(entity[len(entity) - 1], "")
                        
                        tmp_entity = {
                            "start": l,
                            "end": r,
                            "value": value,
                            "entity": entity,
                        }
                        if entity in has_role:
                            tmp_entity["role"] = role
                        tmp_dict["entities"].append(tmp_entity) 
                    # 追加
                    tot_data["rasa_nlu_data"]["common_examples"].append(tmp_dict)
                    
        print(len(tot_data["rasa_nlu_data"]["common_examples"]))
        save_json(tot_data, output_file)
        nlu_json(output_file, "test_nlu.yml", "nlu")


if __name__ == '__main__':
    # 将从doccano得到的标注完成的jsonl文件，以意图作为分类后传入
    intent_annodata_path = ["./raw_data/entity_intent.jsonl",
                            "./raw_data/relation_intent.jsonl",
                            "./raw_data/attr_intent.jsonl"]
    add_data(intent_annodata_path)