Reading Note: Gated Self-Matching Networks for Reading Comprehension and Question Answering

介绍一种名为R-Net的阅读理解风格问题解答模型,该模型通过四个部分实现从给定文章中找到答案:编码问题和文章、匹配问题与文章、自我匹配机制以及指针网络预测答案边界。模型在多项基准任务上取得了最新成果。

Abstract

Authors present the gated self-matching networks for reading comprehension style question answering, which aims to answer questions from a given passage.

Firstly, math the question and passage with gated attention-based recurrent networks to obtatin the question-aware passage representation.
Then, utilize a self-matching attention mechanism to refine the presentation by matching the passage against itself.
Finally, employ the pointer networks to locate the positions of answers from the passage.

Introduction

This model (R-Net) consists of four parts:
1. the recurrent network encoder (to build representation for questions and passage separately)
2. the gated matching layer (to match the question and passage)
3. the self-matching layer (to aggregate information from the whole passage)
4. the pointer network layer (to predict the answer boundary)

Three-fold key contributions:
1. propose a gated attention-based recurrent network, assigning different levels of importance to passage parts depending on their relevance to the question
2. introduce a self-matching mechanism, effectively aggregating evidence from the whole passage to infer the answer and dynamically refining passage representation with information from the whole passage
3. yield state-of-the-art results against strong baselines

Task Description

Given a passage P

and a question Q, predict an answer A to question Q based on information in P

Methods

Gated Self-Matching Networks structure overview

R-NET structure overview

Question and Passage Encoder

Consider a question Q={wQt}mt=1

and a passage P={wPt}nt=1, firstly convert words to word-level embeddings ( {eQt}mt=1 and {ePt}nt=1) and character-level embeddings ( {cQt}mt=1 and {cPt}nt=1

) which are generated by taking final hidden states of a bi-directional recurrent neural network applied to embeddings of characters in the token. Such character-level embeddings have been shown to be helpful to deal with out-of-vocab tokens.

Then use a bi-directional RNN to produce new representation {uQt}mt=1

and {uPt}nt=1

.

uQt=BiRNNQ(uQt1,[eQt,cQt])uPt=BiRNNP(uPt1,[ePt,cPt])

Here, use Gated Recurent Unit (GRU) because it is computationally cheaper.

Gated Attention-based Recurrent Networks

Utilize a gated attention-based recurrent network (a variant of attention-based recurrent networks) to incorporate question information into passage representation.

Given {uQt}mt=1

and {uPt}nt=1, generate question-aware passage representation {vPt}nt=1

via soft-alignment of words

vPt=RNN(vPt1,[uPt,ct])

where [uPt,ct] is another gate to the input ([u^P_t, c_t]) of RNN:
gt=sigmoid(Wg[uPt,ct])[uPt,ct]=gt[uPt,ct]

ct=att(uQ,[uPt,uPt1]) is an attention-pooling vector of the whole question uQ which focuses on the relation between the question and the current passage word:
stjatict===wTtanh(WQuuQj+WPuuPj+WPvvPt1)exp(sti)/j=1mexp(stj)i=1matiuQi

where the vector wT and all matrices W contain weights to be learned.

Self-matching Attention

The self-matching attention is aim to solve the presentation with limited knowledge of context. It dynamically
1. coleects evidence from the whole passage words
2. encodes the evidence relevant to the current passage word and its matching question information into the passage representation hPt

:

hPt=RNN(hPt1,[vPt,ct])

where [vPt,ct] is another gate to the input ([v^P_t, c_t]) of RNN,
ct=att(vP,vPt]) is an attention-pooling vector of the whole question uQ which focuses on the relation between the question and the current passage word:
stjatict===wTtanh(WQuuQj+WPuuPj+WP~vvPt1)exp(sti)/j=1mexp(stj)i=1matiuQi

where the vector wT and all matrices W contain weights to be learned.

After the original self-matching layer of the passage, authors utilize bi-directional GRU to deeply integrate the matching results before feeding them into answer pointer layer. It helps to further propagate the information aggregated by self-matching of the passage.

Output Layer

Use an attention-polling over the question representation to generate the initial hidden vector for the pointer network to predict the start and end position of the answer.

Given a passage representation {hPt}nt=1

, the attention mechanism is utilized as a pointer to select the start position p1 and end position p2

:

stjatipt===wTtanh(WPhhPj+Wahhat1)exp(sti)/j=1nexp(stj)argmax(at1,at2,...,atn)

where hat1 represents the last hidden state of the pointer network,
hat is the attention-pooling vector based on current predicted probability at:
cthat==i=1natihPiRNN(hat1,ct)

And authors utilize the question vector rQ as the initial state of the pointer network, where rQ=att(uQ,VQr) is an attention-pooling vector of the question based on the parameter VQr:
sjairQ===wTtanh(WQuuQj+WQvVQr)exp(si)/j=1nexp(sj)i=1maiuQi

Objective Function

To train the network, minimize the objective function:

J=(i=1n1{p1=i}loga1i+i=1n1{p2=i}loga2i)

Implementation Details

  1. Use the tokenizer from Stanford CoreNLP to preprocess each passage and question
  2. Use the Gated Recurrent Unit
  3. Use GloVe embeddings for questions and passages and fix embeddings
  4. Use zero vectors to prepresent all out-of-vocab words
  5. Use 1 layer of bi-directional GRU to compute character-level embeddings and 3 layers of bi-directional GRU to encode questions and passages
  6. Use bi-directional gated attention-based recurrent network
  7. Set hidden vector length to 75 for all layers
  8. Set hidden size to 75 for attention scores
  9. Set dropout rate to 0.2
  10. Use AdaDelta (an initial learning rate of 1, the decay rate ρ of 0.95, constant ϵ of 1e6)

转载: https://blog.youkuaiyun.com/seanliu96/article/details/79381810
<think>好的,我现在要帮助用户了解或实现Medical Transformer模型中的Gated Axial-Attention机制,特别是用于医学图像分割的部分。首先,我需要回忆一下用户提供的引用内容。用户引用了两篇资料,其中引用[1]是关于Medical Transformer的论文和代码库,而引用[2]列出了其他相关的分割模型,但用户主要关注的是Medical Transformer。 用户的问题具体是关于Gated Axial-Attention的实现,所以需要深入理解这个机制的工作原理。根据论文,Gated Axial-Attention结合了轴向注意力和门控机制,轴向注意力指的是在图像的两个轴上(比如行和列)分别应用自注意力,这样可以减少计算量,同时捕捉长程依赖。门控机制则可能用于调整注意力权重,增强重要特征,抑制不相关的部分。 接下来,我需要考虑如何将这个机制转化为代码实现。可能需要参考论文中的结构图和公式。例如,轴向注意力可能分为行注意力和列注意力两个部分,然后通过门控机制进行融合。门控可能使用sigmoid函数来生成0到1之间的权重,然后将这两个注意力输出加权相加。 然后,我需要检查用户提供的代码链接,确保代码结构正确。用户给出的GitHub仓库是jeya-maria-jose/Medical-Transformer,所以应该参考其中的实现。例如,可能在axial_attention.py或者类似的文件中找到Gated Axial-Attention的代码。 此外,需要注意医学图像的特点,比如高分辨率和局部细节的重要性,这可能影响模型的设计。例如,轴向注意力可以处理二维结构,而门控机制帮助模型专注于关键区域,这对分割任务很重要。 用户可能需要具体的代码示例,所以需要提供PyTorch的实现片段,包括轴向注意力的分解、位置编码的添加,以及门控机制的应用。同时,要解释每个部分的作用,比如如何计算查询、键、值,如何应用位置编码,如何生成门控权重,并融合行和列的注意力结果。 还需要考虑实现中的细节,比如如何将二维图像转换为序列以适应Transformer结构,如何处理位置信息,以及如何将轴向注意力模块集成到UNet等分割网络中。可能还需要讨论数据预处理、训练策略和损失函数的选择,比如结合Dice损失和交叉熵损失。 最后,用户可能对如何应用该模型到自己的医学图像数据感兴趣,所以需要给出训练和推理的步骤,比如加载数据、构建模型、设置优化器和训练循环等。同时,提供相关的参考文献和代码库链接,方便用户进一步查阅。</think>以下是为您整理的Medical Transformer中Gated Axial-Attention实现详解: ### 一、Gated Axial-Attention机制解析 1. **轴向注意力分解**:将2D自注意力分解为行、列两个1D注意力操作 $$ \text{Attention}(Q,K,V) = \text{Softmax}(\frac{QK^T}{\sqrt{d_k}} + P)V $$ 其中位置编码$P$分解为$P_{row}$和$P_{col}$[^1] 2. **门控机制实现**: ```python class GatedAxialAttention(nn.Module): def __init__(self, dim, heads): super().__init__() self.row_att = AxialAttention(dim, heads, axis=0) # 行注意力 self.col_att = AxialAttention(dim, heads, axis=1) # 列注意力 self.gate = nn.Parameter(torch.ones(2)) # 可学习门控参数 def forward(self, x): row_out = self.row_att(x) col_out = self.col_att(x) # 门控融合 return torch.sigmoid(self.gate[0])*row_out + torch.sigmoid(self.gate[1])*col_out ``` ### 二、完整实现步骤 1. **特征序列化**:将2D医学图像转换为序列 ```python b, c, h, w = x.shape x_seq = x.view(b, c, h*w).permute(0, 2, 1) # [b, h*w, c] ``` 2. **位置编码注入**: ```python class AxialPositionalEmbedding(nn.Module): def __init__(self, dim, shape): super().__init__() self.row_emb = nn.Parameter(torch.randn(shape[0], dim//2)) self.col_emb = nn.Parameter(torch.randn(shape[1], dim//2)) def forward(self, x): return x + torch.cat([self.row_emb.unsqueeze(1), self.col_emb.unsqueeze(0)], dim=-1).flatten(0,1) ``` 3. **完整网络架构**(基于UNet框架): ```python class MedicalTransformer(nn.Module): def __init__(self, in_ch=3, classes=2): super().__init__() self.encoder = nn.Sequential( ConvBlock(in_ch, 64), GatedAxialAttention(64, heads=4), Downsample(64, 128), GatedAxialAttention(128, heads=8), Downsample(128, 256) ) self.decoder = ... # 对称上采样结构 ``` ### 三、训练实践要点 1. **数据预处理**: - 医学图像标准化(CT值截断+归一化) - 随机旋转/翻转增强 - 器官特定ROI裁剪 2. **损失函数组合**: ```python loss = 0.5*DiceLoss()(pred, mask) + 0.5*CrossEntropyLoss()(pred, mask) ``` 3. **训练策略**: ```python optimizer = AdamW(model.parameters(), lr=1e-4, weight_decay=1e-5) scheduler = CosineAnnealingLR(optimizer, T_max=100) ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值