数据集来源
斯坦福大学2018年发布的数据集SQuAD 2.0在SQuAD的基础上增加了50000+的unanswerable问题。
文章Know What You Don’t Know: Unanswerable Questions for SQuAD(链接:https://arxiv.org/abs/1806.03822)发表在2018 ACL。
文章给出了数据集的下载链接https://rajpurkar.github.io/SQuAD-explorer/
,提供了训练集和开发集,测试集没有给出,需要在官网提交模型由平台对模型进行测试集的跑分。官网同时给出了人工表现和一些模型在测试集上的表现。
人工:Em:86.831,F1:89.452
模型:最高的EM和F1得分为2020-4-06的SA-Net on Albert的集成模型,EM:90.724和F1:93.011。
训练集统计信息
数据集以json形式给出,训练集大小40MB,开发集大小4MB。
训练数据集以json格式文件储存:
数据集的文章来源于wikipedia,包含人物、电子产品、城市、宗教等不同主题的词条文章442
篇;作为问题的context片段19035
个;问题共130319
(有答案86821个,无答案43498个)个。
训练集数据内容示例
以训练集数据为例,对数据集的格式和内容进行简单说明。
import json
with open(data_file_path,'r',encoding='utf-8') as f:
data = json.load(f)
print(data.keys())
'''可以得到结果dict_keys(['version', 'data'])'''
'''其中version对应数据集的版本,v2.0,data对应的训练集的测试集数据'''
测试集数据部分为列表list
形式,每一个列表元素是一个字典dict
对应一个文本片段
和一些问题
、答案
等。
data = data['data']
print(len(data))
'''训练数据集长度为442'''
字段 | 描述 |
---|---|
title | 标题,如:Frédéric_Chopin |
paragraphs | 对应一些wikipedia的文章段落和基于该段落的问题、答案相关内容,格式为列表list |
qas | 问答对相关内容同 |
context | 段落文本 |
数据的第一条记录下的第一个question和context如下:
dta[0]
# 得到结果如下(部分)
{'title': 'Beyoncé',
'paragraphs': [
{'qas': [{'question': 'When did Beyonce start becoming popular?',
'id': '56be85543aeaaa14008c9063',
'answers': [{'text': 'in the late 1990s', 'answer_start': 269}],
'is_impossible': False}]],
'context':'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".'}
}