1 Introduction
过去的MRC技术的特点:hand-crafted rules or features
缺点
- 不能泛化
- performance may degrade due to large-scale datasets of myriad types of articles ignore long-range dependencies , fail to extract contextual information
MRC研究的不同内容以及对应数量:
一个好的MRC的介绍论文应该:
- 给不同MRC任务具体的定义
- 深度比较它们
- 介绍新趋势和open issues
探索方法
- 谷歌学术,关键词:machine reading comprehension, machine comprehension, reading comprehension
- 顶会论文:ACL, EMNLP, NAACL, ICLR, AAAI, IJCAI and CoNLL, 时间2015–2018
- http://arxiv.org/ , latest pre-print articles
这篇论文关于MRC的大纲结构
论文结构:
- MRC任务的四种分类
cloze tests, multiple choice, span extraction, and free answering.
comparing these tasks in different dimensions 【2】 - 展现neural MRC systems embeddings的通用结构
feature extraction,context-question interaction and answer prediction. 【3】 - 一些代表性的数据集、根据不同的任务而使用的评估指标【4】
- 一些新的趋势
比如knowledge-based MRC, MRC with unanswerable questions, multi-passage MRC and conversational MRC【5】 - 一些open issue、未来可能的研究方向【6】
2 Tasks
MRC的定义:
写者根据回答形式把MRC分为4种分类:cloze tests, multiple choice, span extraction and free answering.
2.1 Cloze Tests
- answer A is a word or entity in the given context C;
- question Q is generated by removing a word or entity from the given context C such that Q = C − A.
2.2 Multiple Choice
2.3 Span Extraction
完形填空和多选的缺点:
- words or entities不够回答,一些回答需要完整的句子
- 有些问题没有condidate answers
2.4 Free Answering
there are no limitations to its answer forms, and it is more suitable for real application scenarios.
2.5 Comparison of Different Tasks
we evaluated five dimensions: construction, understanding, flexibility, evaluation and application.
Because of the flexibility of the answer form, it is somewhat hard to build datasets, and how to effectively evaluate performance on these tasks remains a challenge.
3 Deep-Learning-Based Methods
3.1 General Architecture
一个典型的 neural MRC系统
包含4个核心modules: embeddings, feature extraction, context-question interaction and answer prediction.
一些语言学特征:比如 part-ofspeech, name entity, and question category,结合词表示(one-hot or word2vec)来表达words中的semantic and syntactic信息。
3.2 Typical Deep-Learning Methods
典型MRC系统的组成以及涉及的深度学习方法:
3.2.1 Embeddings
在现有MRC models中,word representation方法可以分为:conventional word representation和pre-trained contextualized representation两种。
为了encode足够的semantic and linguistic信息,multiple granularity (word-level/character-level, 词性,命名实体,词频,问题类别等)也添加进了MRC系统
- Conventional Word Representation 传统词表示
- Pre-trained Contextualized Word Representation 预训练上下文词表示
-
CoVE
CoVE是MT是seq2seq模型+LSTM的encoder
连接MT encoder的输出(encoder的输出被看作CoVE)和用GloVe预训练的word embeddings来表示上下文和question,然后feed them through the coattention and dynamic decoder implemented in a dynamic coattention network (DCN) -
ELMo
如:an improved version of bidirectional attention flow (Bi-DAF) + ELMo
它很容易整合进现有模型中,但是受限于LSTM特征抽取能力不足 -
Generative pre-training (GPT)
a semi-supervised approach combining unsupervised pre-training and supervised fine-tuning.
the transformer architecture used in GPT and GPT-2 is unidirectional (left-to-right),which cannot incorporate context from both directions.
In terms of MRC problems such as multiple choice, concatenate the context and the question with each possible answer and process such sequences with transformer networks. Finally, they produce an output distribution over possible answers to predict correct answers. -
BERT
In particular, for MRC tasks, BERT is so competitive that using it with a simple answer prediction approach shows promise.
缺点:BERT’s pre-training process is time and resource-consuming which makes it nearly impossible to pre-train without abundant computational resources.
- Multiple Granularity 多重粒度
由word2vec 和gloVe预训练的word-level embeddings不能encode足够的syntactic和linguistic信息(比如与part-of-speech, affixes, grammar),为了整合fine-grained(细粒度)的信息到词表示中,用了一下方法来encode the context and the question在不同level上的粒度:
-
Character Embeddings
Seo et al. [75] add character-level embeddings to their Bi-DAF model for the MRC task.
The concatenation of word-level and character-level embeddings are then fed to the next module as input.
(1). CNN的方式:The concatenation of word-level and character-level embeddings are then fed to the next module as input.
(2). character embeddings can be encoded with bidirectional LSTMs. the outputs of the last hidden state are considered to be its character-level representation.
(3). word-level and character-level embeddings can be combined dynamically with a fine-grained