从文本构建金庸家族知识图谱

最新推荐文章于 2025-12-07 19:40:17 发布

原创

最新推荐文章于 2025-12-07 19:40:17 发布 · 1.4k 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#知识图谱 #人工智能

针对对金庸家族的文本描述构建家族知识图谱。

"金庸的姐夫是钱学森，钱学森的妻子是蒋英，金庸的表哥是徐志摩，金庸的堂姐是琼瑶，金庸有一个哥哥，名叫查良钊。金庸有个儿子是查传侠"

本体模型：人与人的亲戚关系(

"表哥": "brother_low",
"姐夫": "sister_husband",
"妻子": "wife",
"堂姐":"cousin",
"父亲": "father",
"母亲": "mother",
"哥哥": "brother",
"姐姐": "sister",
"弟弟": "brother",
"妹妹": "sister",
"儿子": "son",
"女儿": "daughter")

步骤1: 安装必要的库

确保你已经安装了所需的库：

pip install torch transformers py2neo

步骤2: 加载预训练模型

加载 Hugging Face 的 BERT 模型进行命名实体识别：

from transformers import BertTokenizer, BertForTokenClassification
import torch
from py2neo import Graph, Node, Relationship

# 加载tokenizer和model
tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertForTokenClassification.from_pretrained('ckiplab/bert-base-chinese-ner')

步骤3: 文本预处理

准备一段关于金庸家族的文本，并对其进行预处理：

# 文本预处理
text = "金庸的姐夫是钱学森，钱学森的妻子是蒋英，金庸的表哥是徐志摩，金庸的堂姐是琼瑶，金庸有一个哥哥，名叫查良钊。金庸有个儿子是查传侠"
inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True)

步骤4: 实体识别

使用模型预测文本中的实体：

# 实体识别
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2).squeeze()

# 解码预测结果
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
entities = []
current_entity = []
current_label = None

# 遍历每个token和它的预测标签
for token, prediction in zip(tokens, predictions):
    label = model.config.id2label[prediction.item()]

    if label.startswith("B-"):  # 实体开始
        if current_entity:  # 如果有正在处理的实体，先存起来
            entities.append((current_entity, current_label))
        current_entity = [token]  # 新建一个实体
        current_label = label[2:]  # 去掉B-，记录实体类别
    elif label.startswith("I-") or label.startswith("E-"):  # 实体内部或结束
        current_entity.append(token)  # 将token添加到当前实体
        if label.startswith("E-"):  # 如果是实体结束，存储实体
            entities.append((current_entit

最低0.47元/天解锁文章