文档智能：OCR+Rocketqa+layoutxlm＜LayoutLMv2＞

本文链接：https://blog.youkuaiyun.com/qq_42563807/article/details/142103241

关于Rocketqa，链接如下：
文档智能：OCR+Rocketqa+layoutxlm ＜Rocketqa＞

此次先记录LayoutLMv2，梳理相关论文，记录如下：

首先，列举两个名词：

VrDU $\to$ visually-rich document understanding tasks
the text fields of interest，与图像识别的感兴趣区域 region of Interest 类似，AI对该词做出的解释如下：

在文档智能领域，处理和分析文档时，系统需要能够自动识别和提取文档中的关键信息。这些信息通常以文本字段的形式出现，并被定义为“text fields of interest”。

这些字段的重要性在于它们提供了文档的核心内容，有助于实现文档的快速检索、分类、摘要生成等目的。

请注意，layoutlm是一个预训练模型，其基于多模态融合了图像、文本、布局信息，旨在提升模型的理解能力。

预训练模型在大规模文档数据集上训练，能够更好地捕捉文档的语义和结构。

如果真正用在视觉问答上，需要根据具体的downstream tasks来分析；

We select six publicly available benchmark datasets as the downstream tasks to evaluate the performance of the pre-trained LayoutLMv2 model, which are:

the FUNSD dataset (Jaume et al., 2019) for form understanding,
the CORD dataset (Parket al., 2019) and the SROIE dataset (Huang et
al.,2019) for receipt understanding,
the Kleister-NDA dataset for long document understanding with a complex layout,
the RVL-CDIP dataset (Harley et al., 2015) for document image classification,
the DocVQA dataset (Mathew et al., 2021) for visual question answering on
document images.

Distinct from conventional information extraction tasks, the VrDU task relies on not only textual information but also visual and layout information that is vital for visually-rich documents.

To accurately recognize the text fields of interest, it is inevitable to take advantage of the cross-modality nature of visually-rich documents, where the textual, visual, and layout information should be jointly modeled and learned end-to-end in a single framework.

作者介绍到，最近的VrDU任务主要依赖于两个方向：

第一个方向通常建立在文本和视觉/布局/风格信息之间的浅层融合之上。这些方法分别利用预训练的NLP和CV模型，并将来自多种模态的信息组合起来进行监督学习。

尽管已经取得了良好的性能，但一种文档类型的领域知识不能轻易转移到另一种，因此一旦文档类型发生变化，这些模型通常需要重新训练。

第二个方向依赖于不同领域中大量未标记文档的文本、视觉和布局信息之间的深度融合；预训练的模型从不同的文档类型中吸收跨模态知识，从而保持了这些布局和样式之间的局部不变性。
此外，当模型需要转移到具有不同文档格式的另一个域时，只有几个标记的样本就足以微调通用模型；

我个人理解，textual, visual, and layout information ，在同一个模型中，可以特征交互，比如，在视觉问答任务中，可以使用注意力机制，将问题中的关键词与图像中的局部区域对齐，以实现文本和图像的局部不变性。

LayoutLMv2在预训练阶段，利用Transformer来学习视觉和文本信息之间的跨模态交互，集成了视觉信息；

In addition, inspired by the1-D relative position representations, we propose the spatial-aware self-attention mechanism for LayoutLMv2, which involves a 2-D relative position representation for token pairs.

For the pre-training strategies, we use two new training objectives for LayoutLMv2 in addition to the masked visual-language modeling.

The first is the proposed text-image alignment strategy, which aligns the textlines and the corresponding image regions.

The second is the text-image matching strategy , where the model learns whether the document image and textual content are correlated.

在这里插入图片描述

在第二章节的模型介绍中，首先介绍了三部分：Text Embedding、Visual Embedding、Layout Embedding；

1. Text Embedding：

除了对文本进行分词编码之外，还加了起始和终止符号；使用了1-D位置编码，以及 segment $s_i$ $\in$ ${[A], [B]\}$ 。

其中，segment embedding is used to distinguish different text segments.
在这里插入图片描述

注意，序列的最大长度设定为L：
Extra [PAD] tokens are appended to the end so that the final sequence’s length is exactly the maximum sequence length L.

2. Visual Embedding：

使用ResNeXt-FPN结构之后，通过 flatten 操作，得到了W × H 的 VisTokEmb(I)；
再然后，使用一个线性层将 visual token embedding 与 text embeddings 保持在同样的维度；
同理，使用了1-D位置编码，the 1D positional embedding is shared with the text embedding layer.
同理，for the segment embedding, we attach all visual tokens to the visual segment [C].

在这里插入图片描述

3. Layout Embedding:

Embedding the spatial layout information represented by axis-aligned token bounding boxes from the OCR results, in which box width and height together with corner coordinates are identified.
在这里插入图片描述

4. Multi-modal Encoder with Spatial-Aware Self-Attention Mechanism

The encoder concatenates visual embeddings { $v_0$ , …, $v_{W H−1}$ } and text embeddings { $t_0$ , …, $t_{L−1}$ } to a unified sequence,

and fuses spatial information by adding the layout embeddings to get the i-th (0 ≤ i < W H + L) first layer input：
在这里插入图片描述

然后，为了引入相对位置而非绝对位置，在transformer-attention机制中，softmax之前，引入偏置项，b：
在这里插入图片描述
we model the semantic relative position and spatial relative position as bias terms to prevent adding too many parameters.

Let $b^{(1D)}$ , $b^{(2D_x)}$ and $b^{(2D_y)}$ denote the learnable 1D and 2D relative position biases respectively.

Assuming ( $x_i$ , $y_i$ ) anchors the top left corner coordinates of the i-th bounding box, we obtain the spatial-aware attention score:
在这里插入图片描述

关于此处的偏置项：
在深度学习和计算机视觉的上下文中，偏置项通常被设计为与模型中的其他参数（如权重）一起学习和优化，但它们并不直接对应于输入数据的连续特征或位置。

相反，偏置项是模型参数的一部分，用于调整激活函数的输出或注意力机制的分数，以引入额外的灵活性。

在处理具有空间位置信息的任务（如图像中的物体检测或自然语言处理中的位置编码）时，我们可能会想要将空间位置信息以某种方式整合到模型中。
由于空间位置是连续的（例如，图像中的像素坐标），但模型参数（包括偏置项）是离散的（存储在内存中的数值），因此我们需要一种方法来将连续的空间位置映射到离散的参数上。

The biases are different among attention heads but shared in all encoder layers.

The biases are different among attention heads：

这意味着在每个注意力头（attention head）中，偏置项都是不同的。在基于多头注意力（multi-head attention）的模型中，模型会并行地计算多个注意力权重集合，每个集合被称为一个“头”。由于每个头可能关注输入的不同部分或特征，因此为每个头分配不同的偏置项有助于模型捕获并区分这些不同的信息。

but shared in all encoder layers：

虽然每个注意力头有自己的偏置项，但这些偏置项在所有的编码器层（encoder layers）之间是共享的。在像Transformer这样的模型中，编码器通常由多个堆叠的层组成，每层都包含注意力机制和其他组件。这句话意味着，无论在哪个编码器层，同一注意力头的偏置项都是相同的。这种设计有助于减少模型参数的数量，并可能促进不同层之间的信息流动和一致性。

即，在一个具有多头注意力的模型中，每个注意力头都有自己的独特偏置项，但这些偏置项在模型的所有编码器层之间是共享的。这种设计方式结合了模型的表达能力和参数效率。

5. Masked Visual-Language Modeling

randomly mask some text tokens and ask the model to recover the masked tokens.

Meanwhile, the layout information remains unchanged, which means the model knows each masked token’s location on the page.

The output representations of masked tokens from the encoder are fed into a classifier over the whole vocabulary, driven by a cross-entropy loss.
在交叉熵损失的驱动下，来自编码器的 masked tokens 的输出表示，被馈送到整个词汇表上的分类器中。

To avoid visual clue leakage, we mask image regions corresponding to masked tokens on the raw page image input before feeding it into the visual encoder.

6. Text-Image Alignment ：

In the TIA task, some tokens lines are randomly selected, and their image regions are covered on the document image.

注意，这里是tokens被选择，然后覆盖对应的图像；

During pre-training, a classification layer is built above the encoder outputs.
This layer predicts a label for each text token depending on whether it is covered, i.e., [Covered] or [Not Covered], and computes the binary cross-entropy loss.

遮罩式视觉语言模型更关注模型的语言能力，视觉和布局信息只提供隐式线索，为此一种细粒度的多模态对齐任务在 LayoutLM 2.0 中被提出，即文本—图像对齐。该方法在文档图像上随机按行遮盖一部分文本，利用模型的文本部分输出进行词级别二分类，预测每个词是否被覆盖。文本—图像对齐任务帮助模型对齐文本和图像的位置信息。

其他博客引用

7. Text-Image Matching

We feed the output representationat [CLS] into a classifier to predict whether the image and text are from the same document page.