知识图谱问答 | (2) 实体链接概述

本篇博客主要基于微软亚洲研究院段楠老师的《智能问答》第三章 实体链接 进行整理。

1. 简介

实体链接(entity linking)是指将自然语言文本中出现的实体提及(entity mention)关联到对应知识图谱实体上去的任务。实体链接是很多自然语言处理任务的基础,例如知识图谱构建、信息/事件抽取、语意搜索和智能问答等。按照自然语言文本长度的不同,该任务可以进一步分为长文本实体链接短文本实体链接两类:长文本实体链接任务的输入通常是一篇文档(例如一篇 Wikipedia文章),短文本实体链接任务的输入通常是一个句子(例如一个自然语言问题)。由于传统实体链接研究大都围绕长文本开展,因此这里将对长文本实体链接的典型方法进行系统介绍,并针对每类典型方法说明如何将其用于智能问答系统完成短文本实体链接任务。对于无法直接用于智能问答的实体链接方法,只做简要说明,不会展开介绍。

本章将典型的实体链接系统分为如下三个子模块:

  1. 候选实体生成(candidate entity generation)模块,负责从输入文本中检测出实体提及集合M(包括输入文本中提到的所有实体),并从给定知识图谱中找到每个实体提及 m ∈ M m\in M mM可能对应的候选实体集合 E m E_m Em,常用的候选实体生成方法包括词典匹配方法、表层形式扩展法和统计模型法
  2. 候选实体排序(candidate entity ranking)模块(实体消歧),负责对(每个实体提及m对应的)候选实体集合 E m E_m Em中多个候选实体打分和排序,并输出得分最高的候选实体作为m的实体链接结果。常用的候选实体排序方法包括基于监督学习的方法和基于非监督学习的方法;
  3. 无链接指代预测(unlinkable mention prediction),负责预测输入文本中哪些实体提及是无法被链接到知识图谱中去的。这种情况通常是由知识图谱本身的不完备性导致的,即输入文本中提及的实体尚未被现有知识图谱覆盖(在知识图谱中找不到对应的实体)。

下图给出在知识图谱问答场景下的一个实体链接示例。对于自然语言问题where was President Obama born,实体链接系统将问题单词Obama作为一个实体提及,并将其映射到(链接到)知识图谱实体 Barack Obama 上去。
在这里插入图片描述

2. 候选实体生成

候选实体生成(candidate entity generation)任务的目标是从输入文本中检测出实体提及集合M,并从给定知识图谱中找到每个实体提及 m ∈ M m\in M mM可能对应的候选实体集合 E m E_m Em.接下来将介绍两种最常见的方法:词典匹配方法和统计学习方法。

2.1 词典匹配方法

词典匹配方法使用预先抽取的词典完成候选实体生成任务。该类方法使用的词典由N个<实体提及,知识图谱实体>对 { m i , e i } i = 1 N \{m_i,e_i\}_{i=1}^N { mi,ei}i=1N组成, m i m_i mi表示一个实体提及, e i e_i ei表示 m i m_i mi对应的一个知识图谱实体。给定一段自然语言文本,词典匹配方法使用词典中的实体提及集合 { m i } i = 1 N \{m_i\}_{i=1}^N { mi}i=1N,检测文本中出现的实体提及(检测文本中是否出现了某个/某些实体提及),并通过对应关系找到每个实体提及可能对应的全部知识图谱实体候选。注意,同一个实体提及可能对应多个不同的知识图谱候选实体,这是因为不同的知识图谱实体可以共享同一个名称。例如,实体提及 Micheal Jordan 既可以链接到篮球运动员Micheal Jordan,也可以链接到机器学习专家Micheal Jordan,这就需要实体链接系统按照输入文本的内容对不同候选实体进行消歧。关于对不同候选实体进行打分和排序的方法将在之后描述。接下来,首先介绍词典匹配方法中最重要的匹配词典抽取方法,即抽取<实体提及,知识图谱实体>对构建词典。

最常见的匹配词典抽取方法是利用维基百科网站中实体标题、重定向页、消歧页、加粗短语以及超链接之间的内在连接抽取<实体提及,知识图谱实体>。下表给出基于不同类型数据进行词典构建的具体方法。由于维基百科和包括Freebase 在内的很多其他知识图谱都存在很好的对应关系,因此通过上述方法获取的词典能够很好地用于基于其他知识图谱的实体链接任务。

在这里插入图片描述
知识图谱是抽取<实体提及,知识图谱实体>对的另一个数据来源。例如在Freebase 中,绝大多数实体通过谓词 common.topic.alias 可以找到其对应的别名集合。每个别名与当前的知识图谱实体构成一个<m,e>对。

首字母缩写经常出现在自然语言文本中,例如 IBM 表示 International Business Machines.针对首字母缩写进行实体链接对自然语言处理任务十分重要。常用的抽取首字母缩写及对应知识图谱实体的方法可以分为启发式规则方法和监督学习方法两类:基于启发式规则的方法采用预先定义好的模板从大规模语料中抽取首字母缩写及其对应的知识图谱实体全称。常用的模板包括"缩写(全称)”和“全称(缩写)”。拿第一个模板为例,满足该模板的文本片段由缩写和全称两部分组成,缩写部分由大写字母组成,全称部分中每个单词第一个字母连接起来正好等于缩写部分,这样可以抽取出<缩写,全称>构成一个《实体提及,知识图谱实体》对。例如,维基百科文本中出现的片段“···HP(Hewlett-Packard)…”可以成功匹配上述第一个模板,因此可以将首字母缩写 HP 与 Hewlett-Packard 对应起来;基于启发式规则的方法无法处理首字母发生调序的情况,例如,CCP和 Communist Party of China 对应的首字母序列CPC并不相同(上述模版匹配不到这种情况)。基于监督学习的方法可以很好地处理这类问题。例如,Zhang等人从文本中抽取构建<缩写,全称>正负样本,并采用包括字符串相似度和词性标注等在内的特征训练SVM分类器,用来对给定缩写和某个全称候选进行分类(二者是否对应)[1].

此外,还有一些研究工作提出基于搜索引擎查询日志的词典构建方法。例如,Chakrabarti等人[2]和 Cheng等人[3]从搜索引擎日志中识别出能够很好表示知识图谱实体的网页url集合,然后从这些url对应的用户点击查询中抽取实体可能对应的不同别名,并利用一系列统计特征从中选择高质量的别名与当前知识图谱实体构成<m,e>对。Taneva 等人从用户共同点击查询(co-clicked queries)集合中抽取首字母缩写词对应的全称(例如 CMU 和 Carnegie Mellon University)[4].这样,首字母缩写词与实体类型全称构成一个<m,e>对。还可以将输人文本中抽取出来的实体提及和它所在的上下文作为查询发送到通用搜索引擎[5-8]或维基百科站内搜索[9],如果搜索引擎返回结果中包含来自维基百科的页面,那就将该页面对应的知识图谱实体与当前实体提及组成一个<m,e>对。

构建好匹配词典后,基于匹配词典对输入文本进行实体提及识别的方式主要有两种:

  • 第一种采用完全匹配(exact match)方法,即文本中每个实体提及一定要准确出现在匹配词典中。完全匹配方法易于实现,但对词典实体提及集合的覆盖度要求较高。一旦某个实体提及发生变化,匹配过程就会失败。
  • 第二种采用模糊匹配(fuzzy match)方法,即允许文本中每个实体提及和词典中对应的实体提及在字面上存在一定的差异。常见的模糊匹配机制包括:1. 如果文本中某个实体提及被词典中某个实体提及完全包含或它完全包含词典中某个实体提及,那么这两个实体提及模糊匹配成功;2. 如里文本中某个实体提及和词典中某个实体提及存在一定程度的单词重叠,那么这两个实体提及模糊匹配成功;3. 如果文本中某个实体提及和词典中某个实体提及基于字符串相似度算法(例如 character Dice score,skip bigram Dice score,Hamming distance,编辑距离等)具有很高的相似性,那么这两个实体提及模糊匹配成功[10-14]。
2.2 统计学习方法

词典匹配方法采用预先抽取好的实体提及集合对输入文本进行实体提及检测。一旦某些实体提及并未出现在匹配词典抽取的语料中,那么该类方法就无法处理。针对这个问题,本小节将介绍基于统计学习的实体提及检测方法。该类方法通过从标注数据上抽取特征学习统计模型,可以用来检测之前并未见过的实体提及(具有较好的泛化性)。该任务在自然语言处理领域有一个专门的名称:命名实体识别(named entity recognition,NER) (实体抽取).

早期 NER 主要基于词典和规则方法。随着机器学习理论的发展,基于条件随机场(CRF)的模型[15]在 NER 方法中占据了主导地位。该方法将NER看做是一种序列标注任务:
在这里插入图片描述
x = ( x 1 , . . . , x n ) x=(x1,...,x_n) x=(x1,...,xn)表示输入文本对应的单词序列, y = ( y 1 , . . . , y n ) y=(y_1,...,y_n)

TAC KBP Chinese Entity Linking Comprehensive Training and Evaluation Data 2011-2014 LDC2015E17 March 20, 2015 Linguistic Data Consortium 1. Overview Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of Entity Linking is to determine whether or not the entity referred to in each query has a matching entity node in the reference Knowledge Base (KB) (LDC2014T16). If there is a matching node for a query, annotators create a link between the two. If there is not a matching node for a query, the entity is marked as 'NIL' and then clustered with other NIL entities into equivalence classes. For more information, please refer to the Entity Linking section of NIST's 2014 TAC KBP website (2014 was the last year in which the Chinese Entity Linking evaluation was conducted as of the time this package was created) at http://nlp.cs.rpi.edu/kbp/2014/ This package contains all evaluation and training data developed in support of TAC KBP Chinese Entity Linking during the four years since the task's inception in 2011. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2011E46: TAC 2011 KBP Cross-lingual Sample Entity Linking Queries V1.1 LDC2011E55: TAC 2011 KBP Cross-lingual Training Entity Linking V1.1 LDC2012E34: TAC 2011 KBP Cross-Lingual Evaluation Entity Linking Annotation LDC2012E66: TAC 2012 KBP Chinese Entity Linking Web Training Queries and Annotations LDC2012E103: TAC 2012 KBP Chinese Entity Linking Evaluation Annotations V1.2 LDC2013E96: TAC 2013 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V1.2 LDC2014E47: TAC 2014 KBP Chinese Entity Linking Discussion Forum Training Data LDC2014E83: TAC 2014 KBP Chinese Entity Linking Evaluation Queries and Knowledge Base Links V2.0 2. Contents ./README.txt This file ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml This file contains 2176 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/eval/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 120 291 420 831 CMN NW Non-NIL: 279 150 221 650 ENG NW NIL: 90 129 20 239 ENG NW Non-NIL: 93 72 104 269 ENG WB NIL: 16 0 5 21 ENG WB Non-NIL: 44 68 54 166 ---------------------------------------- Total: 624 710 824 2176 ./data/2011/eval/tac_kbp_2011_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/eval/source_documents/* This directory contains all of the source documents listed in the <docid> attribute for each query in tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml. See section 5 for more information about source documents. ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml This file is a concatenation of the queries files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains 2171 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CLCMN_" (if a Chinese language query) or "EL_CLENG_" (if an English language query) plus a five-digit zero-padded, sequentially assigned integer (e.g. "EL_CLCMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2011/training/source_documents/ from which the namestring was extracted. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 124 293 426 843 CMN NW Non-NIL: 284 149 227 660 ENG NW NIL: 143 116 63 322 ENG NW Non-NIL: 122 100 100 322 ENG WB NIL: 0 1 0 1 ENG WB Non-NIL: 14 3 6 23 ---------------------------------------- Total: 687 662 822 2171 ./data/2011/training/tac_kbp_2011_chinese_entity_linking_sample_and_training_KB_links.tab This file is a concatenation of the KB_links files originally released in LDC2011E46 (sample) and LDC2011E55 (training). This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 4 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a four-digit zero-padded sequentially assigned integer (e.g. NIL-0001, NIL-0002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). ./data/2011/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml See section 5 for more information about source documents. ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml This file contains 2122 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 99 89 167 355 CMN NW Non-NIL: 164 167 148 479 CMN WB NIL: 88 86 68 242 CMN WB Non-NIL: 131 112 110 353 ENG NW NIL: 90 79 68 237 ENG NW Non-NIL: 101 107 83 291 ENG WB NIL: 6 26 16 48 ENG WB Non-NIL: 26 52 39 117 ---------------------------------------- Total: 705 718 699 2122 ./data/2012/eval/tac_kbp_2012_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_queries.xml This file contains 158 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL_CMN_" plus a five-digit zero-padded, sequentially assigned integer (e.g., "EL_CMN_00001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2012/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link GPE ORG PER Total ---------------------------------------- CMN NW NIL: 2 2 2 6 CMN NW Non-NIL: 0 2 0 2 CMN WB NIL: 16 16 17 49 CMN WB Non-NIL: 24 25 24 73 ENG WB NIL: 3 4 0 7 ENG WB Non-NIL: 7 5 9 21 ---------------------------------------- Total: 52 54 52 158 ./data/2012/training/tac_kbp_2012_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 5 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2012_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. ./data/2012/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2012_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml This file contains 2155 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL13_CMN" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL13_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2013/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- CMN NW NIL: 123 197 125 445 CMN NW Non-NIL: 124 119 163 406 CMN WB NIL: 112 105 87 304 CMN WB Non-NIL: 173 150 162 485 ENG NW NIL: 52 16 68 136 ENG NW Non-NIL: 83 87 64 234 ENG WB NIL: 11 19 7 37 ENG WB Non-NIL: 28 42 38 108 ----------------------------------------- Total: 706 735 714 2155 ./data/2013/eval/tac_kbp_2013_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2013/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml This file contains 2739 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/eval/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total --------------------------------------------- CMN DF NIL: 118 40 16 174 CMN DF Non-NIL: 426 61 66 553 CMN NW NIL: 179 413 300 892 CMN NW Non-NIL: 349 139 184 672 ENG DF NIL: 1 4 5 10 ENG DF Non-NIL: 5 26 25 56 ENG NW NIL: 10 65 32 107 ENG NW Non-NIL: 87 66 119 272 ENG WB Non-NIL: 1 0 2 3 --------------------------------------------- Total: 1176 814 749 2739 ./data/2014/eval/tac_kbp_2014_chinese_entity_linking_evaluation_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (WB for web data, NW for newswire data, or DF for discussion forum data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/eval/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml See section 5 for more information about source documents. ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_queries.xml This file contains 514 queries. Each query entry consists of the following fields: <query id> - A query ID formatted as the letters "EL14_CMN_TRAINING" plus a four-digit zero-padded, sequentially assigned integer (e.g., "EL14_CMN_TRAINING_0001"). <name> - The full namestring of the query entity. <docid> - An ID for a document in ./data/2014/training/source_documents/ from which the namestring was extracted. <beg> - The starting offset for the namestring. <end> - The ending offset for the namestring. The queries are distributed by language and type as follows: KB-Link PER ORG GPE Total ----------------------------------------- ENG DF NIL: 1 6 3 10 ENG DF Non-NIL: 33 37 41 111 CMN DF NIL: 28 46 6 80 CMN DF Non-NIL: 109 83 121 313 ----------------------------------------- Total: 171 172 171 514 ./data/2014/training/tac_kbp_2014_chinese_entity_linking_training_KB_links.tab This file contains the responses for each query as identified by human annotators at LDC. This file is tab delimited, with 6 fields total. The column descriptions are as follows: 1. query ID - The ID for the query detailed in tac_kbp_2014_chinese_entity_linking_training_queries.xml to which the subsequent information pertains 2. entity ID - A unique entity node ID or NIL ID, correspondent to entity linking annotation and NIL-coreference (clustering) annotation respectively. If the entity node ID begins with "E", the text refers to an entity in the Knowledge Base (TAC KBP Reference Knowledge Base - LDC2014T16). If the given query is not linked to an entity in the Knowledge Base (KB), then it is given a NIL-ID, which consists of "NIL" plus a three-digit zero-padded sequentially assigned integer (e.g. NIL001, NIL002). Both the entities with an entity node ID of "E" type and "NIL" type are assumed to be co-referenced (clustered), with the same "E" type ID or the same "NIL" ID if they refer to the same entity. Each "E" type ID and NIL ID is distinct from one another. 3. entity-type - GPE, ORG, or PER type indicator for the entity 4. genre - WB/NW/DF indicating the source genre of the document for the query (all DF or discussion forum threads in these data). 5. web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. 6. wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. ./data/2014/training/source_documents/* This directory contains all of the source documents listed in the <docid> of tac_kbp_2014_chinese_entity_linking_training_queries.xml See section 5 for more information about source documents. ./dtd/2011_kbpentlink.dtd DTD for: tac_kbp_2011_chinese_entity_linking_evaluation_queries.xml tac_kbp_2011_chinese_entity_linking_sample_and_training_queries.xml ./dtd/2012_2013_2014_kbpentlink.dtd DTD for: tac_kbp_2012_chinese_entity_linking_evaluation_queries.xml tac_kbp_2012_chinese_entity_linking_training_queries.xml tac_kbp_2013_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_evaluation_queries.xml tac_kbp_2014_chinese_entity_linking_training_queries.xml 3. Annotation Given a name string and using information from the query's source document, bilingual Chinese/English-speaking annotators used a specialized search engine to look in the Knowledge Base for a page in which the entity referred to by the query was the central topic. If such a page was found, a link was created between the query and the matching KB node ID. If no matching page was found, the query was marked as NIL and later coreferenced with other NIL entities. Annotators were allowed to use online searching to assist in determining the KB link/NIL status. Queries for which a human annotator could not confidently determine the KB link status were removed from the final data sets. 4. Text Normalization Name string matches are case and punctuation sensitive. The only text normalization performed was: 1. conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed 2. conversion of multiple spaces to a single space 5. Source Documents All the text data in the source files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All documents that have filenames beginning with "cmn-NG" and "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for sources to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 5.1 Newswire Data Newswire data use the following markup framework: <DOC id="{doc_id_string}" type="{doc_type_label}"> <HEADLINE> ... </HEADLINE> <DATELINE> ... </DATELINE> <TEXT> <P> ... </P> ... </TEXT> </DOC> where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "<P> ... </P>" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 5.2 Discussion Forum Data Discussion forum files use the following markup framework: <doc id="{doc_id_string}"> <headline> ... </headline> <post ...> ... <quote ...> ... </quote> ... </post> ... </doc> where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "<a...>...</a>" anchor tags). As mentioned in section 2 above, each <doc> unit contains at least five post elements. All the discussion forum files are parseable as XML. 5.3 Web Document Data "Web" files use the following markup framework: <DOC> <DOCID> {doc_id_string} </DOCID> <DOCTYPE> ... </DOCTYPE> <DATETIME> ... </DATETIME> <BODY> <HEADLINE> ... </HEADLINE> <TEXT> <POST> <POSTER> ... </POSTER> <POSTDATE> ... </POSTDATE> ... </POST> </TEXT> </BODY> </DOC> Other kinds of tags may be present ("<QUOTE ...>", "<A >", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack a corresponding "</QUOTE>"). 6. Using the Data 6.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 6.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&amp;T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 7. Copyright Information (c) 2015 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager <joellis@ldc.upenn.edu> Jeremy Getman, Lead Annotator <jgetman@ldc.upenn.edu> Stephanie Strassel, PI <strassel@ldc.upenn.edu> -------------------------------------------------------------------------- README created by Jeremy Getman on February 4, 2015 updated by Joe Ellis on February 16, 2015 updated by Jeremy Getman on February 17, 2015 updated by Joe Ellis on March 18, 2015
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值