NLP(20)--知识图谱+实体抽取

single-life

已于 2024-05-29 21:06:42 修改

阅读量1k

点赞数 5

分类专栏：笔记文章标签：自然语言处理知识图谱人工智能

于 2024-05-28 20:06:55 首次发布

本文链接：https://blog.youkuaiyun.com/njh1147394013/article/details/139276482

版权

前言

仅记录学习过程，有问题欢迎讨论

基于LLM的垂直领域问答方案：

特点：不是通用语料；准确度要求高，召回率可以低（转人工）；拓展性和可控性（改变特定内容的回答）；确切的评价指标

实现：
传统方法：

知识库+文本匹配（问题转向量找知识库的相似度最高的问题来回答）

基于LLM：

1.直接生成：

直接使用LLM获答案，然后使用SFT微调
缺点：fine-tune的成本高;模型的泛用性下降；不易调整数据

2.RAG思路（推荐）：

段落召回+阅读理解（基于获取的信息给模型提示，期望获取一个更好的答案）
召回的就是你垂直领域的内容，去给llm提示
缺点：对LLM要求高，受召回算法限制（如果正确答案被丢弃，LLM无法挽回）；生成结果不可控

3.基于知识体系（图谱）

树形结构设置知识体系结构，然后给LLM去匹配最可能的知识点选择项，一级一级往下走
缺点：需要大量标注数据，标注数据需要人工标注，标注成本高

知识图谱：

知识图谱是图数据库的一种，用于存储和表示知识。
如：姚明-身高-226cm（三元组）
知识图谱的构建取决于你想要完成的任务，如果想要获取实体之间的关系，就要构建实体关系的图谱

以下为知识图谱的具体内容：

实体抽取：ner任务获取实体属性
关系抽取：

限定领域：
- 文本+实体送入模型预测关系（本质还是分类任务）
- 可以同时训练实体抽取和关系抽取，loss为二者相加
开放领域：
基于序列标注（NER）

属性抽取：同关系抽取

知识融合：

实体对齐：通过判断不同来源的属性的相似度
实体消歧：根据上下文和语义关系进行实体消歧
属性对齐：属性和属性值的相似度计算

知识推理：通过模型来推断两个实体的关系
知识表示：实体关系属性都转化为向量，都可以用id表示某个信息

图数据库的使用：

noe4j
使用NL2SQL把输入的文字变为sql查询

方法1：基于模版+文本匹配，输入的文本去匹配对应的问题模版–再去匹配SQL（依赖模版，易于拓展，可以写复杂sql）
方法2：semantic parsing(语义解析)–通过训练多个模型来获取sql（不易于拓展）
方法3:用LLM写sql

代码展示

构建基于neo4j的知识图谱问答：
这里采用的是方法1，依赖问题模版。

import re
import json
import pandas
import itertools
from py2neo import Graph

"""
使用neo4j 构建基于知识图谱的问答
需要自定义问题模板
"""


class GraphQA:
    def __init__(self):
        # 启动neo4j neo4j console
        self.graph = Graph("http://localhost:7474", auth=("neo4j", "password"))
        schema_path = "kg_schema.json"
        templet_path = "question_templet.xlsx"
        self.load(schema_path, templet_path)
        print("知识图谱问答系统加载完毕！\n===============")

    # 对外提供问答接口
    def query(self, sentence):
        print("============")
        print(sentence)
        # 对输入的句子找和模板中最匹配的问题
        info = self.parse_sentence(sentence)  # 信息抽取
        print("info:", info)
        # 匹配模板
        templet_cypher_score = self.cypher_match(sentence, info)  # cypher匹配
        for templet, cypher, score, answer in templet_cypher_score:
            graph_search_result = self.graph.run(cypher).data()
            # 最高分命中的模板不一定在图上能找到答案, 当不能找到答案时，运行下一个搜索语句, 找到答案时停止查找后面的模板
            if graph_search_result:
                answer = self.parse_result(graph_search_result, answer)
                return answer
        return None

    # 加载模板
    def load(self, schema_path, templet_path):
        self.load_kg_schema(schema_path)
        self.load_question_templet(templet_path)
        return

    # 加载模板信息
    def load_question_templet(self, templet_path):
        dataframe = pandas.read_excel(templet_path)
        self.question_templet = []
        for index in range(len(dataframe)):
            question = dataframe["question"][index]
            cypher = dataframe["cypher"][index]
            cypher_check = dataframe["check"][index]
            answer = dataframe["answer"][index]
            self.question_templet.append([question, cypher, json.loads(cypher_check), answer])
        return

    # 返回输入中的实体，关系，属性
    def parse_sentence(self, sentence):
        # 先提取实体，关系，属性
        entitys = self.get_mention_entitys(sentence)
        relations = self.get_mention_relations(sentence)
        labels = self.get_mention_labels(sentence)
        attributes = self.get_mention_attributes(sentence)
        # 然后根据模板进行匹配
        return {
   "%ENT%": entitys,
                "%REL%": relations,
                "%LAB%": labels,
                "%ATT%": attributes}

    # 获取问题中谈到的实体，可以使用基于词表的方式，也可以使用NER模型
    def get_mention_entitys(self, sentence):
        return re.findall("|".join(self.entity_set), sentence)

    # 获取问题中谈到的关系，也可以使用各种文本分类模型
    def get_mention_relations(self, sentence):
        return re.findall("|".join(self.relation_set), sentence)

    # 获取问题中谈到的属性
    def get_mention_attributes(self, sentence):
        return re.findall("|".join(self.attribute_set), sentence)

    # 获取问题中谈到的标签
    def get_mention_labels(self, sentence):
        return re.findall("|".join(self.label_set), sentence)

    # 加载图谱信息
    def load_kg_schema(self, path):
        with open(path, encoding="utf8") as f:
            schema = json.load(f)
        self.relation_set = set(schema["relations"])
        self.entity_set = set(schema["entitys"])
        self.label_set = set(schema["labels"])
        self.attribute_set = set(schema["attributes"])
        return

    # 匹配模板的问题
    def cypher_match(self, sentence, info):
        # 根据提取到的实体，关系等信息，将模板展开成待匹配的问题文本
        templet_cypher_pair = self.expand_question_and_cypher(info)
        result = []
        for templet, cypher, answer in templet_cypher_pair:
            # 求相似度 距离函数
            score = self.sentence_similarity_function(sentence, templet)
            # print(sentence, templet, score)
            result.append([templet, cypher, score, answer])
        # 取最相似的
        result = sorted(result, reverse=True, key=lambda x: x[2])
        return result

    # 根据提取到的实体，关系等信息，将模板展开成待匹配的问题文本
    def expand_question_and_cypher(self, info):
        templet_cypher_pair = []
        # 模板的数据
        for templet, cypher, cypher_check, answer in self.question_templet:
            # 匹配模板
            cypher_check_result = self.match_cypher_check(cypher_check, info)
            if cypher_check_result:
                templet_cypher_pair += self.expand_templet(templet, cypher, cypher_check, info, answer)
        return templet_cypher_pair

    # 校验 减少比较次数
    def match_cypher_check(self, cypher_check, info):
        for key, required_count in cypher_check.items():
            if len(info.get(key, [])) < required_count:
                return False
        return True

    # 对于单条模板，根据抽取到的实体属性信息扩展，形成一个列表
    # info:{"%ENT%":["周杰伦", "方文山"], “%REL%”:[“作曲”]}
    def expand_templet(self, templet, cypher, cypher_check, info, answer):
        # 获取所有组合
        combinations = self.get_combinations(cypher_check, info)
        templet_cpyher_pair = []
        for combination in combinations:
            # 替换模板中的实体，属性，关系
            replaced_templet = self.replace_token_in_string(templet, combination)
            replaced_cypher = self.replace_token_in_string(cypher, combination)
            replaced_answer = self.replace_token_in_string(answer, combination)
            templet_cpyher_pair.append([replaced_templet, replaced_cypher, replaced_answer])
        return templet_cpyher_pair

    # 对于找到了超过模板中需求的实体数量的情况，需要进行排列组合
    # info:{"%ENT%":["周杰伦", "方文山"], “%REL%”:[“作曲”]}
    def get_combinations(self, cypher_check, info):
        slot_values = []
        for key, required_count in cypher_check.items():
            # 生成所有组合
            slot_values.append(itertools.combinations(info[key], required_count))
        value_combinations = itertools.product(*slot_values)
        combinations = []
        for value_combination in value_combinations:
            combinations.append(self.decode_value_combination(value_combination, cypher_check))
        return combinations

    # 将提取到的值分配到键上
    def decode_value_combination(self, value_combination, cypher_check):
        res = {
   }
        for index, (key, required_count) in enumerate(cypher_check.items

最低0.47元/天解锁文章