HIERARCHY DECODER IS ALL YOU NEED TO TEXT CLASSIFICATION
GitHub
论文目的
Hierarchical text classification (HTC) 会有数据不平衡和层级依赖的缺点,有local和global两种改进方向,hierarchy decoder (HiDEC)基于编码器、解码器的层次递归解码,The key idea of the HiDEC involves decoding a context matrix into a sub-hierarchy sequence using recursive hierarchy decoding, while staying aware of hierarchical dependencies and level information.
相关工作
local
- Hdltex: Hierarchical deep learning for text classification
- Hierarchical transfer learning for multi-label text classification
- Large-scale hierarchical text classification with recursively regularized deep graph-cnn
global
- Hierarchy-aware global model for hierarchical text classification( graph neural networks)HiAGM
- Cognitive structure learning model for hierarchical multi-label text classification
- Learning to learn and predict: A meta-learning approach for multi-label classification(meta-learning)
- Hierarchical text classification with reinforced label assignment(reinforcement learning)
Proposed HTC Model
HTC可以看做图结构,一个层级用一个DAG(directed acyclic graph)
G
=
(
V
,
E
^
)
G=(V,\hat E)
G=(V,E^)
where
V
=
{
v
1
,
.
.
.
,
v
c
)
}
V=\{v_1,...,v_c)\}
V={v1,...,vc)}是层级中的C分类,
E
^
=
{
(
v
i
,
v
j
)
∣
v
i
∈
V
,
v
j
∈
c
h
i
l
d
(
v
i
)
}
\hat E=\{(v_i,v_j)|v_i\in V,v_j\in child(v_i)\}
E^={(vi,vj)∣vi∈V,vj∈child(vi)}是一系列的边(
v
j
v_j
vj是
v
i
v_i
vi的孩子)。
D
=
{
d
1
,
.
.
,
d
K
}
D=\{d_1,..,d_K\}
D={d1,..,dK}是K个文档,每篇文档
d
k
d_k
dk有一个子层级
G
d
k
=
(
V
d
k
,
E
^
d
k
)
G^{d_{k}}=(V^{d_{k}},\hat E^{d_{k}})
Gdk=(Vdk,E^dk)
where
V
d
k
=
L
d
k
∪
{
v
i
d
k
∣
v
i
d
k
∈
a
n
c
e
s
t
o
r
(
v
j
d
k
)
,
v
j
d
k
∈
L
d
k
}
V^{d_k}=L^{d_k} \cup\{v_i^{d_k}|v_i^{d_k}\in ancestor(v_j^{d_k}),v_j^{d_k}\in L^{d_k}\}
Vdk=Ldk∪{vidk∣vidk∈ancestor(vjdk),vjdk∈Ldk},
and
E
^
d
k
=
{
(
v
i
,
v
j
)
∣
v
j
∈
V
d
k
,
v
i
∈
p
a
r
e
n
t
(
v
j
)
}
\hat E^{d_k}=\{(v_i,v_j)|v_j\in V_{d_k},v_i\in parent(v_j)\}
E^dk={(vi,vj)∣vj∈Vdk,vi∈parent(vj)},
where
L
d
k
=
{
v
1
d
k
,
v
2
d
k
,
.
.
.
,
v
t
d
k
}
L^{d_k}=\{v_1^{d_k},v_2^{d_k},...,v_t^{d_k}\}
Ldk={v1dk,v2dk,...,vtdk} is label set document
d
k
d_k
dk。
即
G
d
k
G^{d_{k}}
Gdk是文档
d
k
d_{k}
dk对应的所有标签以及他们的祖先。
初始化时,
G
^
0
d
k
=
(
{
v
r
o
o
t
}
,
∅
)
\hat G_0^{d_{k}}=(\{v_{root}\},\emptyset)
G^0dk=({vroot},∅)只有一个root,没有一条边,层次递归解码循环生成p次,使得
G
^
p
d
k
=
G
d
k
\hat G_p^{d_{k}}=G^{d_{k}}
G^pdk=Gdk
Encoder
任何常用的encoder都可以在这个地方使用,毕竟本文主要的思想在decoder上
本文使用SRU(a simple recurrent unit)
one-hot vector for an index of the n-th token
T
=
[
w
1
,
.
.
,
w
N
]
T=[w_1,..,w_N]
T=[w1,..,wN]
convert into word embeddings
H
0
=
W
0
T
∈
∣
R
N
∗
e
H^0=W^0T \in|R^{N*e}
H0=W0T∈∣RN∗e
H
←
l
=
S
R
U
←
l
(
H
l
−
1
)
\overleftarrow H^l=\overleftarrow {SRU}^l(H^{l-1})
Hl=SRUl(Hl−1)
H
→
l
=
S
R
U
→
l
(
H
l
−
1
)
\overrightarrow H^l=\overrightarrow {SRU}^l(H^{l-1})
Hl=SRUl(Hl−1)
H
l
=
W
l
[
H
←
l
,
H
→
l
]
+
b
l
H^l=W^l[\overleftarrow H^l,\overrightarrow H^l]+b^l
Hl=Wl[Hl,Hl]+bl
Hierarchy Decoder (HiDEC)
Hierarchy Embedding Layer
三个特殊符号"("、")"、"[end]",其中"("、")“意味着类别路径的开始和结束,”[end]"意味着当前路径的结束。
上图通过这种策略形成的句子是S=[(R(A(D(I([END]))))(B(F([END])))(C([END])))]
the tokens in S are represented as one-hot vectors for further processing(one-hot)形成
S
^
\hat S
S^
U
^
0
=
W
S
S
^
−
−
−
−
−
−
−
−
−
−
−
(
2
)
\hat U^0=W^S\hat S -----------(2)
U^0=WSS^−−−−−−−−−−−(2)
U
0
=
l
e
v
e
l
−
e
m
b
e
d
d
i
n
g
(
U
^
0
)
−
−
−
−
−
−
−
−
−
−
−
(
3
)
U^0=level-embedding(\hat U^0)-----------(3)
U0=level−embedding(U^0)−−−−−−−−−−−(3)
Level-wise Masked Self-Attention
level-wise masking 仅保留从根到子层次结构的最大级别的祖先-后代依赖关系
decoder中第r层的query、key、value依照下式计算:
Q
=
W
Q
r
U
r
−
1
T
Q=W^r_QU^{{r-1}^T}
Q=WQrUr−1T
K
=
W
K
r
U
r
−
1
T
K=W^r_KU^{{r-1}^T}
K=WKrUr−1T ----------------------- (4)
V
=
W
V
r
U
r
−
1
T
V=W^r_VU^{{r-1}^T}
V=WVrUr−1T
带有level-wise masking 的self-attention 分数
U
˙
r
=
M
a
s
k
e
d
−
a
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
s
o
f
t
m
a
x
(
Q
K
T
e
+
M
)
V
−
−
−
−
−
−
(
5
)
\dot U^r=Masked-attention(Q,K,V)=softmax(\frac {QK^T}{\sqrt e}+M)V------(5)
U˙r=Masked−attention(Q,K,V)=softmax(eQKT+M)V−−−−−−(5)
当两类不是ancestor-descendant关系时,将
M
i
j
=
−
1
e
9
M_{ij}=-1e9
Mij=−1e9, 考虑了三种特殊字符和其它字符(包括自己)的关系。
masking 矩阵M定义如下:
M
i
j
=
{
−
1
e
9
i
f
v
i
∉
a
n
c
e
s
t
o
r
(
v
j
)
0
e
l
s
e
(
6
)
M_{ij}=\begin{cases} -1e9 & if v_i \notin ancestor(v_j) \\ 0 & else & (6) \end{cases}
Mij={−1e90ifvi∈/ancestor(vj)else(6)
Text-Hierarchy Attention
这部分对应的是transformer中decoder的cross-attention部分
decoder中第r层cross-attention的query、key、value依照下式计算:
Q
=
W
Q
r
U
˙
r
−
1
T
Q=W^r_Q\dot U^{{r-1}^T}
Q=WQrU˙r−1T
K
=
W
K
r
H
T
K=W^r_KH^{T}
K=WKrHT ----------------------- (7)
V
=
W
V
r
H
T
V=W^r_VH^{T}
V=WVrHT
cross-attention 分数
U
¨
r
=
M
a
s
k
e
d
−
a
t
t
e
n
t
i
o
n
(
Q
,
K
,
V
)
=
s
o
f
t
m
a
x
(
Q
K
T
e
)
V
−
−
−
−
−
−
(
8
)
\ddot U^r=Masked-attention(Q,K,V)=softmax(\frac {QK^T}{\sqrt e})V------(8)
U¨r=Masked−attention(Q,K,V)=softmax(eQKT)V−−−−−−(8)
FFN
U
r
=
F
e
e
d
F
o
r
w
a
r
d
(
U
¨
r
)
−
−
−
−
−
−
(
9
)
U^r=FeedForward(\ddot U^r) ------(9)
Ur=FeedForward(U¨r)−−−−−−(9)
Sub-hierarchy Expansion
CE和BCE
experiments
Dataset
- RCV1-v2
- Web-of-Science (WOS)
评价指标:Micro-F1 and Macro-F1
Performances
encoder中选取了两种:SRU和TextRCNN ,可以看到,global的效果比local的好,最后的依然是HIAGM-GCN和HCSM(我没法下载,有人可以下载然后发给我吗?)
Analysis
从模型复杂度(复杂度低)、注意力的解释(self-attention scores学到了层级依赖性;attention scores学到了单词和分类的关系)、层级空间的关系(HIAGM完全聚类、HiDEC可以聚类,没有HIAGM那么开)
阅读感官
写于20220112,作者如果想公开代码,就好好公开不好吗?非要放个地址,里面又啥都没有,这有意思吗?
HIAGM效果是真抗打啊,20年的论文,到22年了,效果依旧是最好的,而且作者很良心,开源且复现成本不高