之间已经介绍过几个比较经典的 loss function 啦,这里再补充三个最近看到的 loss function。
Large Margin Cosine Loss
这个 loss function 的提出,基本思想是将损失函数的计算从距离空间转换到角度空间。
欧几里得空间 (距离空间)
→
\rightarrow
→ 余弦空间 (角度空间)
首先复习一下 softmax loss:
L
s
=
1
N
∑
i
=
1
N
−
log
p
i
=
1
N
∑
i
=
1
N
−
log
e
f
y
i
∑
j
=
1
C
e
f
j
L_{s}=\frac{1}{N} \sum_{i=1}^{N}-\log p_{i}=\frac{1}{N} \sum_{i=1}^{N}-\log \frac{e^{f_{y_{i}}}}{\sum_{j=1}^{C} e^{f_{j}}}
Ls=N1i=1∑N−logpi=N1i=1∑N−log∑j=1Cefjefyi
- N:训练样本数;
- p i p_{i} pi 是 x i x_{i} xi 被正确分类的后验概率;
- C C C 是类别数目;
- f j f_{j} fj 表示全连接层的激活公式,有 f j = W j T x = ∥ W j ∥ ∥ x ∥ cos θ j f_{j}=W_{j}^{T} x=\left\|W_{j}\right\|\|x\| \cos \theta_{j} fj=WjTx=∥Wj∥∥x∥cosθj 。(令偏重 B j = 0 B_{j}=0 Bj=0)
为了消除径向方向的影响,令 ∥ W j ∥ = 1 \left\|W_{j}\right\|=1 ∥Wj∥=1 , ∥ x ∥ = s \|x\|=s ∥x∥=s 。
为了充分提高损失函数的分类能力,引入余弦余量
m
m
m,得到 Large Margin Cosine Loss (LMCL) 定义如下:
L
l
m
c
=
1
N
∑
i
−
log
e
s
(
cos
(
θ
y
i
,
i
)
−
m
)
e
s
(
cos
(
θ
y
i
,
i
)
−
m
)
+
∑
j
≠
y
i
e
s
cos
(
θ
j
,
i
)
L_{l m c}=\frac{1}{N} \sum_{i}-\log \frac{e^{s\left(\cos \left(\theta_{y_{i}, i}\right)-m\right)}}{e^{s\left(\cos \left(\theta_{y_{i}, i}\right)-m\right)}+\sum_{j \neq y_{i}} e^{s \cos \left(\theta_{j, i}\right)}}
Llmc=N1i∑−loges(cos(θyi,i)−m)+∑j̸=yiescos(θj,i)es(cos(θyi,i)−m)
其中,
W
=
W
∗
∥
W
∗
∥
x
=
x
∗
∥
x
∗
∥
cos
(
θ
j
,
i
)
=
W
j
T
x
i
\begin{aligned} W &=\frac{W^{*}}{\left\|W^{*}\right\|} \\ x &=\frac{x^{*}}{\left\|x^{*}\right\|} \\ \cos \left(\theta_{j}, i\right) &=W_{j}^{T} x_{i} \end{aligned}
Wxcos(θj,i)=∥W∗∥W∗=∥x∗∥x∗=WjTxi
以二分类问题为例,比较 Softmax loss,Normalized softmax loss (NSL),A-Softmax loss,LMCL四种损失函数的决策边界:
- Softmax: ∥ W 1 ∥ cos ( θ 1 ) = ∥ W 2 ∥ cos ( θ 2 ) \left\|W_{1}\right\| \cos \left(\theta_{1}\right)=\left\|W_{2}\right\| \cos \left(\theta_{2}\right) ∥W1∥cos(θ1)=∥W2∥cos(θ2)
- NSL: cos ( θ 1 ) = cos ( θ 2 ) \cos \left(\theta_{1}\right)=\cos \left(\theta_{2}\right) cos(θ1)=cos(θ2)
- A-Softmax: C 1 : cos ( m θ 1 ) ≥ cos ( θ 2 ) C 2 : cos ( m θ 2 ) ≥ cos ( θ 1 ) \begin{aligned} C_{1} & : \cos \left(m \theta_{1}\right) \geq \cos \left(\theta_{2}\right) \\ C_{2} & : \cos \left(m \theta_{2}\right) \geq \cos \left(\theta_{1}\right) \end{aligned} C1C2:cos(mθ1)≥cos(θ2):cos(mθ2)≥cos(θ1)
- LMCL: C 1 : cos ( θ 1 ) ≥ cos ( θ 2 ) + m C 2 : cos ( θ 2 ) ≥ cos ( θ 1 ) + m \begin{array}{l}{C_{1} : \cos \left(\theta_{1}\right) \geq \cos \left(\theta_{2}\right)+m} \\ {C_{2} : \cos \left(\theta_{2}\right) \geq \cos \left(\theta_{1}\right)+m}\end{array} C1:cos(θ1)≥cos(θ2)+mC2:cos(θ2)≥cos(θ1)+m
参考论文链接:https://arxiv.org/abs/1801.09414
Additive Angular Margin Loss
ArcFace 和 CosFace 的基本思想差不多,只不过在添加余弦余量
m
m
m 的方式上略有不同,ArcFace 的loss function 为:
L
A
r
c
=
−
1
N
∑
i
=
1
N
log
e
s
(
cos
(
θ
y
i
+
m
)
)
e
s
(
cos
(
θ
y
i
+
m
)
)
+
∑
j
=
1
,
j
≠
y
i
n
e
s
cos
θ
j
L_{Arc}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s\left(\cos \left(\theta_{y_{i}}+m\right)\right)}}{e^{s\left(\cos \left(\theta_{y_{i}}+m\right)\right)}+\sum_{j=1, j \neq y_{i}}^{n} e^{s \cos \theta_{j}}}
LArc=−N1i=1∑Nloges(cos(θyi+m))+∑j=1,j̸=yinescosθjes(cos(θyi+m))
通过一个联合的式子来概括SphereFace(
m
1
m_{1}
m1),ArcFace(
m
2
m_{2}
m2),CosFace(
m
3
m_{3}
m3):
L
u
n
i
=
−
1
N
∑
i
=
1
N
log
e
s
(
cos
(
m
1
θ
y
i
+
m
2
)
−
m
3
)
e
s
(
cos
(
m
1
θ
y
i
+
m
2
)
−
m
3
)
+
∑
j
=
1
,
j
≠
y
i
n
e
s
cos
θ
j
L_{uni}=-\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s\left(\cos \left(m_{1} \theta_{y_{i}}+m_{2}\right)-m_{3}\right)}}{e^{s\left(\cos \left(m_{1} \theta_{y_{i}}+m_{2}\right)-m_{3}\right)}+\sum_{j=1, j \neq y_{i}}^{n} e^{s \cos \theta_{j}}}
Luni=−N1i=1∑Nloges(cos(m1θyi+m2)−m3)+∑j=1,j̸=yinescosθjes(cos(m1θyi+m2)−m3)
几何上的差异可以直观地表示为:
参考论文链接:https://arxiv.org/abs/1801.07698v1
Adaptive Cosine-based Loss
之前讲的 LMCL 有两个关键的超参数
s
s
s 和
m
m
m ,是需要手动进行调参的,而 Adaptive Cosine-based Loss 则是增加了自适应调整这两个参数的部分。
分析:
s
s
s 要大一点好,但不能太大;同样,
m
m
m 也是大点好,但不能太大。与尺度参数
s
s
s 相比,余弦余量
m
m
m 仅使曲线同相移动,因此 AdaCos 将
m
m
m 从损失函数中消除,只自动调整
s
s
s。
最后得到的固定尺度参数
s
~
f
\tilde{s}_{f}
s~f 和动态自适应尺度参数
s
~
d
(
t
)
\tilde{s}_{d}^{(t)}
s~d(t) 为:
s
~
f
=
log
B
i
cos
π
4
=
log
∑
k
≠
y
i
e
s
⋅
cos
θ
i
,
k
cos
π
4
≈
2
⋅
log
(
C
−
1
)
\begin{aligned} \tilde{s}_{f}=\frac{\log B_{i}}{\cos \frac{\pi}{4}} &=\frac{\log \sum_{k \neq y_{i}} e^{s \cdot \cos \theta_{i, k}}}{\cos \frac{\pi}{4}} \\ & \approx \sqrt{2} \cdot \log (C-1) \end{aligned}
s~f=cos4πlogBi=cos4πlog∑k̸=yies⋅cosθi,k≈2⋅log(C−1)
s
~
d
(
t
)
=
{
2
⋅
log
(
C
−
1
)
t
=
0
log
B
avg
(
t
)
cos
(
min
(
π
4
,
θ
med
(
t
)
)
)
t
≥
1
\tilde{s}_{d}^{(t)}=\left\{\begin{array}{ll}{\sqrt{2} \cdot \log (C-1)} & {t=0} \\ {\frac{\log B_{\text { avg }}^{(t)}}{\cos \left(\min \left(\frac{\pi}{4}, \theta_{\text { med }}^{(t)}\right)\right)}} & {t \geq 1}\end{array}\right.
s~d(t)=⎩⎨⎧2⋅log(C−1)cos(min(4π,θ med (t)))logB avg (t)t=0t≥1
其中,
B
a
v
g
(
t
)
=
1
N
∑
i
∈
N
(
t
)
B
i
(
t
)
=
1
N
∑
i
∈
N
(
t
)
∑
k
≠
y
i
e
s
~
d
(
t
−
1
)
⋅
cos
θ
i
,
k
B_{\mathrm{avg}}^{(t)}=\frac{1}{N} \sum_{i \in \mathcal{N}^{(t)}} B_{i}^{(t)}=\frac{1}{N} \sum_{i \in \mathcal{N}^{(t)}} \sum_{k \neq y_{i}} e^{\tilde{s}_{d}^{(t-1)} \cdot \cos \theta_{i, k}}
Bavg(t)=N1∑i∈N(t)Bi(t)=N1∑i∈N(t)∑k̸=yies~d(t−1)⋅cosθi,k
每次一迭代中,动态自适应尺度参数
s
~
d
(
t
)
\tilde{s}_{d}^{(t)}
s~d(t) 对分类概率
P
i
,
j
(
t
)
P_{i, j}^{(t)}
Pi,j(t) 的影响都不相同,并且有效地影响了更新网络参数的梯度信息
(
∂
L
(
x
⃗
i
)
∂
x
⃗
i
,
∂
L
(
W
⃗
j
)
∂
W
⃗
j
)
(\frac{\partial \mathcal{L}\left(\vec{x}_{i}\right)}{\partial \vec{x}_{i}},\frac{\partial \mathcal{L}\left(\vec{W}_{j}\right)}{\partial \vec{W}_{j}})
(∂xi∂L(xi),∂Wj∂L(Wj)) :
P
i
,
j
(
t
)
=
e
s
~
d
(
t
)
⋅
cos
θ
i
,
j
∑
k
=
1
C
e
s
~
d
(
t
)
⋅
cos
θ
i
,
k
P_{i, j}^{(t)}=\frac{e^{\tilde{s}_{d}^{(t)} \cdot \cos \theta_{i, j}}}{\sum_{k=1}^{C} e^{\tilde{s}_{d}^{(t)} \cdot \cos \theta_{i, k}}}
Pi,j(t)=∑k=1Ces~d(t)⋅cosθi,kes~d(t)⋅cosθi,j
∂
L
(
x
⃗
i
)
∂
x
⃗
i
=
∑
j
=
1
C
(
P
i
,
j
(
t
)
−
1
(
y
i
=
j
)
)
⋅
s
~
d
(
t
)
∂
cos
θ
i
,
j
∂
x
⃗
i
∂
L
(
W
⃗
j
)
∂
W
⃗
j
=
(
P
i
,
j
(
t
)
−
1
(
y
i
=
j
)
)
⋅
s
~
d
(
t
)
∂
cos
θ
i
,
j
∂
W
⃗
j
\begin{array}{l}{\frac{\partial \mathcal{L}\left(\vec{x}_{i}\right)}{\partial \vec{x}_{i}}=\sum_{j=1}^{C}\left(P_{i, j}^{(t)}-\mathbb{1}\left(y_{i}=j\right)\right) \cdot \tilde{s}_{d}^{(t)} \frac{\partial \cos \theta_{i, j}}{\partial \vec{x}_{i}}} \\ {\frac{\partial \mathcal{L}\left(\vec{W}_{j}\right)}{\partial \vec{W}_{j}}=\left(P_{i, j}^{(t)}-\mathbb{1}\left(y_{i}=j\right)\right) \cdot \tilde{s}_{d}^{(t)} \frac{\partial \cos \theta_{i, j}}{\partial \vec{W}_{j}}}\end{array}
∂xi∂L(xi)=∑j=1C(Pi,j(t)−1(yi=j))⋅s~d(t)∂xi∂cosθi,j∂Wj∂L(Wj)=(Pi,j(t)−1(yi=j))⋅s~d(t)∂Wj∂cosθi,j