OHSUMED数据集介绍

OHSUMED数据集源自1987年至1991年MEDLINE的医药文章,包含348566个文档,用于文档过滤任务。数据集包括文档的OHSUMED序号、MEDLINE标识、来源、MeSH索引词等8个域。此外,还有106个医生查询,每个查询包含患者信息和信息需求。数据集提供了相关、部分相关和不相关的文档-查询对,用于评估检索效果。
1. OHSUMED数据集介绍

本实验中采用OHSUMED测试数据集合(其也被用于第9 届国际文本检索竞赛TREC9 的文档过滤子竞赛)。OHSUMED 数据集合由William Hersh和他的同事们一起建立,其文档来源于医药信息数据库MEDLINE10,它包含了从1987 年到1991 年五年间270 个医药类杂志的标题和/或摘要,包含了348566个文档。一个OHSUMED 文档由8 个域组成,含义如下:

z .I 文章的OHSUMED 序列号,从1 到348566

z .U MEDLINE 标识

z .S 文章来源

z .M MeSH 索引词

z .T 文章标题

z .P 文章类型

z .W 文章摘要

z .A 文章作者

OHSUMED 的作者还为文档集合构造了106 个查询,这些查询来源于医生在给病人看病的过程中所提交的查询字符串,每一个查询由两部分组成:病人情况的简单描述和所需信息的描述。一个OHSUMED 查询由如下3 不同域组成:

z .I 文章的OHSUMED 序列号,从1 到106

z .B 患者信息

z .W 信息需求

基于以上的文档集合和查询集合,OHSUMED 一共标注了16140 个查询-文

档对,每一个查询-文档对都被标注成相关(definitely relevant)、部分相关(partially relevant)或者不相关(not relevant),最终的标注结果中一共包含了2557个相关、2932 个部分相关以及12498 个不相关的查询-文档对(一个文档可能被标记成多个级别,在本节的实验中,取其级别最高的标号作为其最终标号)。

Here are the files, their uncompressed size, and a description of their content:

1)  ohsumed.87 (60,303,307) — Contains the MEDLINE documents for the year 1987.  The format for each of the MEDLINE document files follows the conventions of the SMART system, with each field defined as below (NLM designator in parentheses):
    .I    sequential identifier
    .U    MEDLINE identifier (UI)
    .M    Human-assigned MeSH terms (MH)
    .T    Title (TI)
    .P    Publication type (PT)
    .W    Abstract (AB)
    .A    Author (AU)
    .S    Source (SO)
(Note:  Some references have their abstracts truncated at 250 words, while some have no abstracts at all.)

2)  ohsumed.88 (78,585,929) — Contains the MEDLINE documents for the year 1988, formatted as above.

3)  ohsumed.89 (84,719,077) — Contains the MEDLINE documents for the year 1989, formatted as above.

4)  ohsumed.90 (86,754,890) — Contains the MEDLINE documents for the year 1990, formatted as above.

5)  ohsumed.91 (89,761,122) — Contains the MEDLINE documents for the year 1991, formatted as above.

6)  queries (11,591) — Contains the 106 queries in test set, with patient and topic information, in the format:
    .I    Sequential identifier
    .B    Patient information
    .W    Information request

7)  drel.ui (26,919) — Contains the query-document pairs rated as definitely relevant, with documents listed by MEDLINE UI, in the format:
   

8)  drel.i (21,709) — Contains the query-document pairs rated as definitely relevant, with documents listed by sequential number (from the .I field),  in the format:
   

9)  pdrel.ui (57,831) — Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by MEDLINE UI,  in the format:
   

10)  pdrel.i (46,664) — Contains the query-doc pairs rated as definitely or possibly relevant, with documents listed by sequential number (from the .I field),  in the format:
   

11)  judged (368,366) — Contains a list of all retrieved documents by any of the five original searchers or SMART, sorted first by query number and then document number, along with their relevance judgments.  The relevance judgments are either d (definitely relevant), p (possibly relevant), or n (not relevant).  The relevance1 judgment is the original relevance judgment done on the documents retrieved by the original searchers.  The relevance 2 judgment is the second relevance judgment done to assess interobserver reliability of the relevance1 judgments.  The relevance3 judgment is the relevance judgment done on documents retrieved by SMART but not the original searchers, or another relevance judgment on an originally retrieved document to assess interobserver reliability.
   
    [ ][ ]

12)  ui (3,137,094) — Contains the MEDLINE UI’s for all 348,566 documents in test database, listed one per line.

13)  readme — This file.
http://ir.ohsu.edu/ohsumed/ohsumed.html

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值