Orderless Recurrent Models for Multi-label Classification Paper PDF
文章目录
Introduction
Multi-label classification is the task of assigning a wide range of visual concepts (labels) to images. The large variety of concepts and the uncertain relations among them make this a very challenging task and, to successfully address it. RNNs have demonstrated good performance in many tasks that require processing variable length sequential data including multi-label classification. This modal takes account the relation-patterns among labels into training process naturally. But since RNNs produce sequential outputs, labels need to be ordered for the multi-label classification task.
Several recent works have tried to address this issue by imposing an arbitrary, but consistent, ordering to the ground truth label sequences. Despite alleviating the problem, these approaches are short of solving it, and many of the original issues are still present. For example, in an image that features a clearly visible and prominent dog, the LSTM may chose to predict that label first, as the evidence for it is very large. However, if dog is not the label that happened to be first in the chosen ordering, the network will be penalized for that output, and then penalized again for not predicting dog in the “correct” step according to the ground truth sequence. In this way,the training process could become very slow.
In this paper, we propose ways to dynamically order the ground truth labels with the predicted label sequence. There are the two ways of doing that: predicted label alignment (PLA) and minimal loss alignment (MLA). We empirically show that these approaches lead to faster training and also eliminate other nuisances like repeated labels in the predicted sequence.
Innovation
- Orderless recurrent models with the minimal loss alignment (MLA) and predicted label alignment (PLA)
Method
Image-to-sequence model
This type of model consists of a CNN (encoder) part that extracts a compact visual representation from the image, and of an RNN (decoder) part that uses the encoding to generate a sequence of labels, modeling the label dependencies.
Linearized activations from the fourth convolutional layer are used as input for the attention module, along with the hidden state of the LSTM at each time step, thus the attention module focuses on different parts of the image every time. These attention weighted features are then concatenated with the word embedding of the class predicted in the previous time step, and given to the LSTM as input for the current time step.
The predictions for the current time step t t t are computed in the following way:
x t = E ⋅ l ^ t − 1 h t = L S T M ( x t , h t − 1 , c t − 1 ) p t = W ⋅ h t + b \begin{aligned} x_{t} &= E \cdot \hat{l}_{t-1} \\ h_{t} &= LSTM(x_{t}, h_{t-1}, c_{t-1}) \\ p_{t} &= W \cdot h_{t} + b \end{aligned} xthtpt=E⋅l^t−1=LSTM(xt,ht−1,ct−1)=W⋅ht+b
where E E E is a word embedding matrix and l ^ t − 1 \hat{l}_{t-1} l^t−1 is the predicted label index in the previous time step. c t c_t ct and h t h_t ht are the model cell and hidden states in the previous LSTM unit. The prediction vector is denoted by p t p_t p