Reading Note: Gated Self-Matching Networks for Reading Comprehension and Question Answering

最新推荐文章于 2024-09-15 08:01:02 发布

转载最新推荐文章于 2024-09-15 08:01:02 发布 · 467 阅读

NLP/DeepLearning 专栏收录该内容

319 篇文章

订阅专栏

介绍一种名为R-Net的阅读理解风格问题解答模型，该模型通过四个部分实现从给定文章中找到答案：编码问题和文章、匹配问题与文章、自我匹配机制以及指针网络预测答案边界。模型在多项基准任务上取得了最新成果。

Abstract

Authors present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage.

Firstly, math the question and passage with gated attention-based recurrent networks to obtatin the question-aware passage representation.
Then, utilize a self-matching attention mechanism to refine the presentation by matching the passage against itself.
Finally, employ the pointer networks to locate the positions of answers from the passage.

Introduction

This model (R-Net) consists of four parts:
1. the recurrent network encoder (to build representation for questions and passage separately)
2. the gated matching layer (to match the question and passage)
3. the self-matching layer (to aggregate information from the whole passage)
4. the pointer network layer (to predict the answer boundary)

Three-fold key contributions:
1. propose a gated attention-based recurrent network, assigning different levels of importance to passage parts depending on their relevance to the question
2. introduce a self-matching mechanism, effectively aggregating evidence from the whole passage to infer the answer and dynamically refining passage representation with information from the whole passage
3. yield state-of-the-art results against strong baselines

Task Description

Given a passage P

and a question Q, predict an answer A to question Q based on information in P

Methods

Gated Self-Matching Networks structure overview

R-NET structure overview

Question and Passage Encoder

Consider a question Q={wQt}mt=1

and a passage P={wPt}nt=1, firstly convert words to word-level embeddings ( {eQt}mt=1 and {ePt}nt=1) and character-level embeddings ( {cQt}mt=1 and {cPt}nt=1

) which are generated by taking final hidden states of a bi-directional recurrent neural network applied to embeddings of characters in the token. Such character-level embeddings have been shown to be helpful to deal with out-of-vocab tokens.

Then use a bi-directional RNN to produce new representation {uQt}mt=1

and {uPt}nt=1

u Q t = BiRNN Q (u Q t - 1, [e Q t, c Q t]) u P t = BiRNN P (u P t - 1, [e P t, c P t])

Here, use Gated Recurent Unit (GRU) because it is computationally cheaper.

Gated Attention-based Recurrent Networks

Utilize a gated attention-based recurrent network (a variant of attention-based recurrent networks) to incorporate question information into passage representation.

Given {uQt}mt=1

and {uPt}nt=1, generate question-aware passage representation {vPt}nt=1

via soft-alignment of words

v P t = RNN (v P t - 1, [u P t, c t] *)

where [uPt,ct]∗ is another gate to the input ([u^P_t, c_t]) of RNN:

g t = sigmoid (W g [u P t, c t]) [u P t, c t] * = g t ⊙ [u P t, c t]

ct=att(uQ,[uPt,uPt−1]) is an attention-pooling vector of the whole question uQ which focuses on the relation between the question and the current passage word:

s t j a t i c t = = = w T tanh (W Q u u Q j + W P u u P j + W P v v P t - 1) exp (s t i) / \sum j = 1 m exp (s t j) \sum i = 1 m a t i u Q i

where the vector wT and all matrices W∗ contain weights to be learned.

Self-matching Attention

The self-matching attention is aim to solve the presentation with limited knowledge of context. It dynamically
1. coleects evidence from the whole passage words
2. encodes the evidence relevant to the current passage word and its matching question information into the passage representation hPt

h P t = RNN (h P t - 1, [v P t, c t] *)

where [vPt,ct]∗ is another gate to the input ([v^P_t, c_t]) of RNN,
ct=att(vP,vPt]) is an attention-pooling vector of the whole question uQ which focuses on the relation between the question and the current passage word:

s t j a t i c t = = = w T tanh (W Q u u Q j + W P u u P j + W P ~ v v P t - 1) exp (s t i) / \sum j = 1 m exp (s t j) \sum i = 1 m a t i u Q i

where the vector wT and all matrices W∗ contain weights to be learned.

After the original self-matching layer of the passage, authors utilize bi-directional GRU to deeply integrate the matching results before feeding them into answer pointer layer. It helps to further propagate the information aggregated by self-matching of the passage.

Output Layer

Use an attention-polling over the question representation to generate the initial hidden vector for the pointer network to predict the start and end position of the answer.

Given a passage representation {hPt}nt=1

, the attention mechanism is utilized as a pointer to select the start position p1 and end position p2

s t j a t i p t = = = w T tanh (W P h h P j + W a h h a t - 1) exp (s t i) / \sum j = 1 n exp (s t j) argmax (a t 1, a t 2, . . ., a t n)

where hat−1 represents the last hidden state of the pointer network,
hat is the attention-pooling vector based on current predicted probability at:

c t h a t = = \sum i = 1 n a t i h P i RNN (h a t - 1, c t)

And authors utilize the question vector rQ as the initial state of the pointer network, where rQ=att(uQ,VQr) is an attention-pooling vector of the question based on the parameter VQr:

s j a i r Q = = = w T tanh (W Q u u Q j + W Q v V Q r) exp (s i) / \sum j = 1 n exp (s j) \sum i = 1 m a i u Q i

Objective Function

To train the network, minimize the objective function:

J = - (\sum i = 1 n 1 {p 1 = i} log a 1 i + \sum i = 1 n 1 {p 2 = i} log a 2 i)

Implementation Details

Use the tokenizer from Stanford CoreNLP to preprocess each passage and question
Use the Gated Recurrent Unit
Use GloVe embeddings for questions and passages and fix embeddings
Use zero vectors to prepresent all out-of-vocab words
Use 1 layer of bi-directional GRU to compute character-level embeddings and 3 layers of bi-directional GRU to encode questions and passages
Use bi-directional gated attention-based recurrent network
Set hidden vector length to 75 for all layers
Set hidden size to 75 for attention scores
Set dropout rate to 0.2
Use AdaDelta (an initial learning rate of 1, the decay rate ρ of 0.95, constant ϵ of 1e−6)