GENIA项目-GENIA语料库

最新推荐文章于 2024-04-07 04:15:00 发布

weixin_34090643

最新推荐文章于 2024-04-07 04:15:00 发布

阅读量1.4k

点赞数

CC 4.0 BY-SA版权

文章标签：数据库

原文链接：https://yq.aliyun.com/articles/373010

GENIA语料库是一个生物医学领域的自然语言处理资源，其POS标注遵循宾州树库标准并进行了调整。语料库提供PTB-like和"Merged" gpml两种格式，其中后者处理了被<term>标签分割的分词问题。GENIA的标注工作由JunK分词器初步完成，再由人工修正。相关标注指南和技术报告可在资源下载页面找到。

GENIA corpus

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.

GENIA语料库是为GENIA项目编写并标注的最初的生物医学文献集合。这个语料库是为了发展和评估分子生物学信息检索及文本挖掘系统而创建的。

The corpus contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated with various levels of linguistic and semantic information.

PubMed 是一个免费的搜寻引擎，提供生物医学方面的论文搜寻以及摘要。它的数据库来源为MEDLINE。其核心主题为医学，但亦包括其他与医学相关的领域，像是护理学或者其他健康学科。它同时也提供对于相关生物医学资讯上相当全面的支援，像是生化学与细胞生物学。该搜寻引擎是由美国国立医学图书馆提供，作为 Entrez 资讯检索系统的一部分。PubMed 的资讯并不包括期刊论文的全文，但可能提供指向全文提供者（付费或免费）的连结。

这个语料库包含1999条Medline的摘要，这些摘要是由PubMed按照human、blood cells以及transcription factors三个医学主题词（medical subject heading terms ）为搜索条件搜索到的。这个语料库已经被按照不同级别的语言信息、语义信息进行标注。

The primary categories of annotation in the GENIA corpus and the corresponding subcorpora are

最初始的GENIA语料库标注类别以及对应的资料如下：