A Relation-Oriented Clustering Method for Open Relation Extraction

本文提出一种面向关系的聚类方法RoCORE,利用预定义关系的标记数据优化非线性映射,将高维实体对表示转换为面向关系的表示,实现对未标记新关系的聚类和识别。实验表明,此方法在两个真实世界数据集上显著降低错误率。

1.Abstract

The clustering-based unsupervised relation discovery method has gradually become one of the important methods of open relation extraction (OpenRE). 基于聚类的无监督关系发现方法逐渐成为开放关系抽取(OpenRE)的重要方法之一。

However, high-dimensional vectors can encode complex linguistic information which leads to the problem that the derived clusters cannot explicitly align with the relational semantic classes.然而,高维向量可以编码复杂的语言信息,这导致派生的簇不能与关系语义类显式对齐的问题。

In this work, we propose a relation oriented clustering model and use it to identify the novel relations in the unlabeled data. 在这项工作中,我们提出了一个面向关系的聚类模型,并用它来识别未标记数据中的新关系。

Specifically, to enable the model to learn to cluster relational data, our method leverages the readily available labeled data of pre-defined relations to learn a relation oriented representation. 具体来说,为了使模型能够学习对关系数据进行聚类,我们的方法利用预定义关系的现成标记数据来学习面向关系的表示。

We minimize distance between the instance with same relation by gathering the instances towards their corresponding relation centroids to form a cluster structure, so that the learned representation is cluster-friendly. 我们通过将实例聚集到它们对应的关系质心以形成集群结构来最小化具有相同关系的实例之间的距离,从而使学习到的表示对集群友好。

To reduce the clustering bias on predefined classes, we optimize the model by minimizing a joint objective on both labeled and unlabeled data.为了减少预定义类的聚类偏差,我们通过最小化标记和未标记数据的联合目标来优化模型。

Experimental results show that our method reduces the error rate by 29.2% and 15.7%, on two datasets respectively, compared with current SOTA methods.实验结果表明,与当前的 SOTA 方法相比,我们的方法在两个数据集上分别将错误率降低了 29.2% 和 15.7%。

2.Introduction

Relation extraction (RE), a crucial basic task in the field of information extraction, is of the utmost practical interest to various fields including web search (Xiong et al., 2017), knowledge base completion (Bordes et al., 2013), and question answering (Yu et al., 2017).关系抽取(RE),信息抽取领域的一项关键基础任务, 对包括网络搜索(Xiong 等人,2017 年)、知识库补全(Bordes 等人,2013 年)和问答(Yu 等人,2017 年)在内的各个领域都具有最大的实际意义

However, conventional RE paradigms such as supervision and distant supervision are generally designed for pre-defined rel

### Python 实现关键词聚类算法示例 为了在Python中实现关键词聚类,可以采用多种不同的方法和技术。这里提供一种基于TF-IDF向量化和K-Means聚类的方法来完成这一任务。 #### 数据准备与预处理 首先需要准备好待聚类的文档集合,并对其进行必要的清理工作: ```python import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.decomposition import PCA import matplotlib.pyplot as plt documents = [ "Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement" ] # 文本清洗函数(简化版) def clean_text(text): text = text.lower() words = text.split(' ') return ' '.join(words) cleaned_documents = [clean_text(doc) for doc in documents] ``` #### 特征提取 利用`TfidfVectorizer`将文本转换成数值型特征矩阵,以便后续用于模型训练[^1]。 ```python vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(cleaned_documents) ``` #### 应用K-Means聚类 通过调用Scikit-Learn库中的`KMeans`模块来进行实际的聚类操作。此处假设已知最佳簇数量为2;如果未知,则需先确定最优k值。 ```python num_clusters = 2 km = KMeans(n_clusters=num_clusters, random_state=42).fit(X) labels = km.labels_ print(labels) ``` #### 可视化结果 为了让聚类效果更加直观易懂,可以通过降维技术如PCA减少维度至二维空间并绘制散点图展示各个样本之间的关系。 ```python pca = PCA(n_components=2) reduced_features = pca.fit_transform(X.toarray()) plt.scatter(reduced_features[:, 0], reduced_features[:, 1], c=labels, cmap='viridis', marker='o') for i, txt in enumerate(documents): plt.annotate(txt[:15]+'...', (reduced_features[i, 0], reduced_features[i, 1])) plt.title("Document Clustering Visualization") plt.show() ``` 上述过程展示了如何使用Python构建一个简单的关键词聚类系统。当然还有其他更复杂的方案可供选择,比如LDA主题建模、Word Embedding等高级技巧也可以应用于此场景下进一步提升性能表现。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值