Loss Function Evolution for Face Recognition


损失函数往往导向着模型的收敛方向。一个好的损失函数对于所要解决的问题至关重要。现如今,人脸识别方法都是将人脸映射为低维feature vector,通过对比feature vector之间的距离来判断该人脸是否属于同一Identity。由于人脸类别多,同一类别的样本量少,因此通过训练模型增加类间(inter-class)距离,减小类内(intra-class)距离成为人脸识别领域的主要优化方向。近些年人脸识别算法的提升大多依靠损失函数的改进。

Softmax Loss

Softmax

学过神经网络的同学,应该对softmax函数并不陌生,经常作为分类网络的最后一层,输出预测类别的概率分布。
s i = e f i ∑ j = 0 c e f j (1) s_{i}=\frac{e^{f_{i}}}{\sum_{j=0}^{c}e^{f_{j}}} \tag{1} si=j=0cefjefi(1)
其中, s i s_{i} si代表模型预测 i i i-th类别的概率值, c c c为类别数。

Softmax Loss

L = − ∑ i = 1 c y i l o g s i (2) \mathfrak{L} = -\sum_{i=1}^{c}y_{i}logs_{i} \tag{2} L=i=1cyilogsi(2)
其中, y i y_i yi是标签向量 i i i-th的值。由于分类问题对应只有一个类别为真实类别,所以在标签向量中,只有表示真实标签的位置的值为1,其余位置为0。因此上式可以简化为
L = − l o g s k (3) \mathfrak{L} = -logs_{k} \tag{3} L=logsk(3)
其中, k k k为真实标签对应的索引。如果采用了mini-batch的训练方式,上式可以进一步写为:
L = − ∑ i = 1 m l o g s k i (4) \mathfrak{L} = -\sum_{i=1}^{m}logs_{k_i} \tag{4} L=i=1mlogski(4)
其中, m m m表示mini-batch的大小, k i k_i ki表示 i i i-th个样本的真实标签的索引。

直观的理解, s i ∈ [ 0 , 1 ] s_{i} \in [0,1] si[0,1], 而函数 y = − l o g x y=-logx y=logx在这个范围内属于减函数,如下图所示。该函数距离1越近,函数值越小,距离0越远,函数值越大。模型总是向着Loss函数值减小的方向迭代,因此,在监督训练下,模型会向输出真实标签,也即对位真实标签的预测概率为1的方向收敛。
在这里插入图片描述

Cross Entropy

本来想自己写的,并简要描述一下信息量和信息熵的概念。看了很多资料也没有找出确切的有关解释。链接虽然也没有讲述清楚概念,但本博客主要是写关于损失函数的,连接在这方面对交叉熵写的很好,所以直接引荐过来。

交叉熵损失函数原理详解


Center Loss

Center Loss损失函数实在Softmax Loss的基础上,增加对特征距离该类别的中心特征约束。根据公式4和公式1可得:
L = − ∑ i = 1 m l o g s k i = − ∑ i = 1 m l o g e W k i T x i + b k i ∑ j = 0 c e W j T x i + b j (5) \begin{aligned} \mathfrak{L} &= -\sum_{i=1}^{m}logs_{k_i} \\ & = -\sum_{i=1}^{m}log\frac{e^{W_{k_{i}}^{T}x_{i}+b_{k_{i}}}}{\sum_{j=0}^{c}e^{W_{j}^{T}x_{i}+b_{j}}} \tag{5} \end{aligned} L=i=1mlogski=i=1mlogj=0ceWjTxi+bjeWkiTxi+bki(5)
其中, W j T W_{j}^{T} WjT代表最后一个全连接层的权重矩阵的 j j j-th列, x i x_{i} xi代表 i i i-th个样本的输出特征, b j b_{j} bj代表所对应的偏置。
在此基础上,增加输出特征距离特征中心距离的约束,可得到:
L = − ∑ i = 1 m l o g e W k i T x i + b k i ∑ j = 0 c e W j T x i + b j + λ 2 m ∑ i = 1 m ∥ x i − c k i ∥ 2 2 (6) \mathfrak{L} = -\sum_{i=1}^{m}log\frac{e^{W_{k_{i}}^{T}x_{i}+b_{k_{i}}}}{\sum_{j=0}^{c}e^{W_{j}^{T}x_{i}+b_{j}}} + \frac{\lambda}{2m}\sum_{i=1}^{m}\|x_i-\mathfrak{c}_{k_i}\|_{2}^{2} \tag{6} L=i=1mlogj=0ceWjTxi+bjeWkiTxi+bki+2mλi=1mxicki22(6)
其中, c k i \mathfrak{c}_{k_i} cki表示 k i k_i ki-th类别中心特征。 λ \lambda λ用来控制两者权重。效果图如下:

在这里插入图片描述


Triplet Loss

在这里插入图片描述
三元组损失函数,三元组由Anchor, Negative, Positive这三个组成。我们想让Anchor和Positive尽量的靠近(同类距离),Anchor和Negative尽量的远离(类间距离)。
L ( A , P , N ) = ∑ i = 1 m m a x ( ∥ f ( x i A ) − f ( x i P ) ∥ 2 2 − ∥ f ( x i A ) − f ( x i N ) ∥ 2 2 + α , 0 ) (7) \mathfrak{L}(A,P,N)=\sum_{i=1}^{m}max(\|f(x_{i}^{A})-f(x_{i}^{P})\|_{2}^{2}-\|f(x_{i}^{A})-f(x_{i}^{N})\|_{2}^{2}+\alpha,0) \tag{7} L(A,P,N)=i=1mmax(f(xiA)f(xiP)22f(xiA)f(xiN)22+α,0)(7)
其中, ∥ f ( x i A ) − f ( x i P ) ∥ 2 2 \|f(x_{i}^{A})-f(x_{i}^{P})\|_{2}^{2} f(xiA)f(xiP)22代表同类间输出特征的距离, ∥ f ( x i A ) − f ( x i N ) ∥ 2 2 \|f(x_{i}^{A})-f(x_{i}^{N})\|_{2}^{2} f(xiA)f(xiN)22则代表不同类别间输出特征的距离。 α \alpha α表示margin。


L-Softmax

根据公式5,进一步分析可得:
L = − ∑ i = 1 m l o g e W k i T x i + b k i ∑ j = 0 c e W j T x i + b j = − ∑ i = 1 m l o g e ∥ W k i T ∥ ∥ x i ∥ c o s ( θ i k i ) + b k i ∑ j = 0 c e ∥ W j T ∥ ∥ x i ∥ c o s ( θ i j ) + b j (8) \begin{aligned} \mathfrak{L} &= -\sum_{i=1}^{m}log\frac{e^{W_{k_{i}}^{T}x_{i}+b_{k_{i}}}}{\sum_{j=0}^{c}e^{W_{j}^{T}x_{i}+b_{j}}} \\ &= -\sum_{i=1}^{m}log\frac{e^{\|W_{k_{i}}^{T}\| \|x_{i}\|cos(\theta_{ik_i})+b_{k_{i}}}}{\sum_{j=0}^{c}e^{\|W_{j}^{T}\| \|x_{i}\|cos(\theta_{ij})+b_{j}}} \tag{8} \end{aligned} L=i=1mlogj=0ceWjTxi+bjeWkiTxi+bki=i=1mlogj=0ceWjTxicos(θij)+bjeWkiTxicos(θiki)+bki(8)
其中, θ \theta θ表示特征与对应权重向量间的角度值。假设只有两个类别,根据公式8,两个类别的决策边界可以计算为:
∥ x i ∥ ( ∥ W k i T ∥ c o s ( θ i k i ) − ∥ W j T ∥ c o s ( θ i j ) ) + ( b k i − b j ) = 0 \|x_{i}\| (\|W_{k_{i}}^{T}\| cos(\theta_{ik_i}) -\|W_{j}^{T}\| cos(\theta_{ij})) +(b_{k_i}-b_j)=0 xi(WkiTcos(θiki)WjTcos(θij))+(bkibj)=0
忽略偏置, 我们希望:
∥ W k i T ∥ ∥ x i ∥ c o s ( θ i k i ) > ∥ W j T ∥ ∥ x i ∥ c o s ( θ i j ) \|W_{k_{i}}^{T}\| \|x_{i}\|cos(\theta_{ik_i}) > \|W_{j}^{T}\| \|x_{i}\|cos(\theta_{ij}) WkiTxicos(θiki)>WjTxicos(θij)
L-softmax损失函数则提出:
∥ W k i T ∥ ∥ x i ∥ c o s ( m θ i k i ) > ∥ W j T ∥ ∥ x i ∥ c o s ( θ i j ) \|W_{k_{i}}^{T}\| \|x_{i}\|cos(m\theta_{ik_i}) > \|W_{j}^{T}\| \|x_{i}\|cos(\theta_{ij}) WkiTxicos(mθiki)>WjTxicos(θij)
其中, θ i k i ∈ [ 0 , π m ] \theta_{ik_i} \in [0, \frac{\pi}{m}] θiki[0,mπ]。由于cos函数在 [ 0 , π ] [0,\pi] [0,π]属于减函数,这使得:

∥ W k i T ∥ ∥ x i ∥ c o s ( θ i k i ) ≥ ∥ W k i T ∥ ∥ x i ∥ c o s ( m θ i k i ) > ∥ W j T ∥ ∥ x i ∥ c o s ( θ i j ) \|W_{k_{i}}^{T}\| \|x_{i}\|cos(\theta_{ik_i}) \ge \|W_{k_{i}}^{T}\| \|x_{i}\|cos(m\theta_{ik_i}) > \|W_{j}^{T}\| \|x_{i}\|cos(\theta_{ij}) WkiTxicos(θiki)WkiTxicos(mθiki)>WjTxicos(θij)
m的加入,使得模型的学习难度增大。 与以往增加margin的方式不同,这种方式通过增加angular-margin来增加类间距离。如在二分类情况下,分类边界增加的margin如下图所示。
在这里插入图片描述
为了限制 θ \theta θ角度,原文中将损失函数则定义如下:
L = − ∑ i = 1 m l o g e ∥ W k i T ∥ ∥ x i ∥ ψ ( θ i k i ) e ∥ W k i T ∥ ∥ x i ∥ ψ ( θ i k i ) + ∑ j ≠ k i c e ∥ W j T ∥ ∥ x i ∥ c o s ( θ i j ) (9) \mathfrak{L}=-\sum_{i=1}^{m}log\frac{e^{\|W_{k_{i}}^{T}\| \|x_{i}\| \psi(\theta_{ik_i})}}{e^{\|W_{k_{i}}^{T}\| \|x_{i}\| \psi(\theta_{ik_i})}+\sum_{j \neq k_i}^{c}e^{\|W_{j}^{T}\| \|x_{i}\|cos(\theta_{ij})}} \tag{9} L=i=1mlogeWkiTxiψ(θiki)+j=kiceWjTxicos(θij)eWkiTxiψ(θiki)(9)
其中,
ψ ( θ ) = ( − 1 ) k c o s ( m θ ) − 2 k , θ ∈ [ k π m , ( k + 1 ) π m ] \psi(\theta) = (-1)^{k}cos(m\theta)-2k, \theta \in[\frac{k \pi}{m}, \frac{(k+1) \pi}{m}] ψ(θ)=(1)kcos(mθ)2k,θ[mkπ,m(k+1)π]


SphereFace(A-Softmax)

A-Softmax与L-Softmax非常类似。在公式8的基础上做进一步限制: ∥ W j ∥ = 1 \|W_j\|=1 Wj=1 and b j = 0 b_j=0 bj=0。因此决策边界仅仅与 θ \theta θ角度(特征向量与对应的权重向量之间的夹角)有关:
( c o s ( θ i k i ) − c o s ( θ i j ) ) = 0 (cos(\theta_{ik_i}) - cos(\theta_{ij})) = 0 (cos(θiki)cos(θij))=0
我们希望:
c o s ( θ i k i ) > c o s ( θ i j ) cos(\theta_{ik_i}) > cos(\theta_{ij}) cos(θiki)>cos(θij)
A-softmax损失函数则提出:
c o s ( m θ i k i ) > c o s ( θ i j ) cos(m\theta_{ik_i}) >cos(\theta_{ij}) cos(mθiki)>cos(θij)
其中, θ i k i ∈ [ 0 , π m ] \theta_{ik_i} \in [0, \frac{\pi}{m}] θiki[0,mπ]。由于cos函数在 [ 0 , π ] [0,\pi] [0,π]属于减函数,这使得:

c o s ( θ i k i ) ≥ c o s ( m θ i k i ) > c o s ( θ i j ) cos(\theta_{ik_i}) \ge cos(m\theta_{ik_i}) > cos(\theta_{ij}) cos(θiki)cos(mθiki)>cos(θij)

由于 ∥ W j ∥ = 1 \|W_j\|=1 Wj=1,因此特征的角度就可以在超球面上进行描述:
在这里插入图片描述
总的损失函数如下,与公式9相比,仅仅多了权重归一化。
L = − ∑ i = 1 m l o g e ∥ x i ∥ ψ ( θ i k i ) e ∥ x i ∥ ψ ( θ i k i ) + ∑ j ≠ k i c e ∥ x i ∥ c o s ( θ i j ) (10) \mathfrak{L}=-\sum_{i=1}^{m}log\frac{e^{ \|x_{i}\| \psi(\theta_{ik_i})}}{e^{ \|x_{i}\| \psi(\theta_{ik_i})}+\sum_{j \neq k_i}^{c}e^{\|x_{i}\|cos(\theta_{ij})}} \tag{10} L=i=1mlogexiψ(θiki)+j=kicexicos(θij)exiψ(θiki)(10)
其中,
ψ ( θ ) = ( − 1 ) k c o s ( m θ ) − 2 k , θ ∈ [ k π m , ( k + 1 ) π m ] \psi(\theta) = (-1)^{k}cos(m\theta)-2k, \theta \in[\frac{k \pi}{m}, \frac{(k+1) \pi}{m}] ψ(θ)=(1)kcos(mθ)2k,θ[mkπ,m(k+1)π]


CosFace

根据公式8,Loss函数的计算是根据两个特征向量之间的余弦相似度计算的。如果分别对W和X做L2 Normalization, 使其Norm为1,考虑到X的Norm太小会导致训练loss太大(softmax的值太小),对其进行一次缩放,固定为大小S。则

L = − ∑ i = 1 m l o g e s c o s ( θ i k i ) ∑ j = 0 c e s c o s ( θ i j ) (11) \mathfrak{L}=-\sum_{i=1}^{m}log\frac{e^{s cos(\theta_{ik_i})}}{\sum_{j=0}^{c}e^{s cos(\theta_{ij})}} \tag{11} L=i=1mlogj=0cescos(θij)escos(θiki)(11)
与L-Softmax和A-Softmax增加margin的方式不同,作者采用了直接在余弦值上增加margin,即:
c o s ( θ i k i ) − m > c o s ( θ i j ) cos(\theta_{ik_i}) - m > cos(\theta_{ij}) cos(θiki)m>cos(θij)
此时,Loss 函数可表示为:
L = − ∑ i = 1 m l o g e s ( c o s ( θ i k i ) − m ) e s ( c o s ( θ i k i ) − m ) + ∑ j ≠ k i c e s c o s ( θ i j ) (12) \mathfrak{L}=-\sum_{i=1}^{m}log\frac{e^{s (cos(\theta_{ik_i})-m)}}{e^{s (cos(\theta_{ik_i})-m)}+\sum_{j \neq k_i}^{c}{e^{s cos(\theta_{ij})}} } \tag{12} L=i=1mloges(cos(θiki)m)+j=kicescos(θij)es(cos(θiki)m)(12)

s . t . W = W ∗ ∥ W ∗ ∥ X = X ∗ ∥ X ∗ ∥ c o s ( θ i j ) = W j T x i \begin{aligned} s.t. W &= \frac{W^*}{\|W^*\|} \\ X &= \frac{X^*}{\|X^*\|} \\ cos(\theta_{ij}) &= W_{j}^{T}x_i \end{aligned} s.t.WXcos(θij)=WW=XX=WjTxi

由于X特征向量经过了Normalization,所以,特征分布在一个超球面上,其二维和高维情况下的效果如下,可见m的引入增加了类间距离。
在这里插入图片描述
与之前的损失函数对比:


ArcFace/Insight Face

ArcFace的思想(additive angular margin)和SphereFace以及不久前的CosineFace(additive cosine margin )有一定的共同点,重点在于:在ArchFace中是直接在角度空间(angular space)中最大化分类界限,而CosineFace是在余弦空间中最大化分类界限,这也是为什么这篇文章叫ArcFace的原因,因为arc含义和angular一样。

L = − ∑ i = 1 m l o g e s ( c o s ( θ i k i + m ) e s ( c o s ( θ i k i + m ) + ∑ j ≠ k i c e s c o s ( θ i j ) (13) \mathfrak{L}=-\sum_{i=1}^{m}log\frac{e^{s (cos(\theta_{ik_i}+m)}}{e^{s (cos(\theta_{ik_i}+m)}+\sum_{j \neq k_i}^{c}{e^{s cos(\theta_{ij})}} } \tag{13} L=i=1mloges(cos(θiki+m)+j=kicescos(θij)es(cos(θiki+m)(13)

s . t . W = W ∗ ∥ W ∗ ∥ X = X ∗ ∥ X ∗ ∥ c o s ( θ i j ) = W j T x i \begin{aligned} s.t. W &= \frac{W^*}{\|W^*\|} \\ X &= \frac{X^*}{\|X^*\|} \\ cos(\theta_{ij}) &= W_{j}^{T}x_i \end{aligned} s.t.WXcos(θij)=WW=XX=WjTxi

截至目前,不同损失函数的优化方向对比:
在这里插入图片描述

最后的最后,看一下FaceBook对这些方法的看法:Facebook 爆锤深度度量学习:该领域13年来并无进展!网友:沧海横流,方显英雄本色

### C语言中洗牌和发牌过程模拟 #### 设计思路 在C语言中实现扑克牌的洗牌和发牌功能,可以采用结构体来表示每张牌的信息。具体来说: - 使用`struct CARD`定义一张牌的数据结构,其中包含花色(`suit`)和牌面(`face`)两个成员变量[^2]。 ```c #define SUITS 4 /* 花色数量 */ #define FACES 13 /* 牌面数量 */ typedef struct { char suit[10]; // 存储花色字符串 char face[10]; // 存储牌面字符串 } Card; ``` #### 初始化卡组 创建一个大小为52的数组用于存储一副完整的扑克牌,并初始化这些卡片的内容。这里可以通过嵌套循环遍历所有的可能组合完成这一操作。 ```c Card deck[SUITS * FACES]; void initializeDeck(Card deck[]) { const char* suits[] = {"Hearts", "Diamonds", "Clubs", "Spades"}; const char* faces[] = {"Ace", "2", "3", "4", "5", "6", "7", "8", "9", "10", "Jack", "Queen", "King"}; int index = 0; for (int i = 0; i < SUITS; ++i) { for (int j = 0; j < FACES; ++j) { strcpy(deck[index].suit, suits[i]); strcpy(deck[index].face, faces[j]); index++; } } } ``` #### 实现洗牌算法 为了打乱顺序,可利用标准库中的`srand()`函数设置随机种子以及`rand()`获取伪随机数来进行交换位置的操作。此方法被称为Fisher-Yates Shuffle Algorithm。 ```c #include <stdlib.h> #include <time.h> // Fisher–Yates shuffle algorithm implementation. void shuffleDeck(Card deck[], size_t n) { srand(time(NULL)); // Initialize random seed. for (size_t i = n - 1; i > 0; --i) { size_t j = rand() % (i + 1); // Swap elements at indices 'i' and 'j'. Card temp = deck[i]; deck[i] = deck[j]; deck[j] = temp; } } ``` #### 执行发牌逻辑 当完成了上述准备工作之后,则可以根据需求分配指定数目给玩家。下面是一个简单的例子展示如何分发五手牌给两位参与者[^1]。 ```c void dealHands(const Card deck[], Card hands[][SUITS * FACES], int numPlayers, int cardsPerHand) { for (int player = 0; player < numPlayers; ++player) { printf("\nPlayer %d:\n", player + 1); for (int cardIndex = 0; cardIndex < cardsPerHand; ++cardIndex) { hands[player][cardIndex] = deck[player * cardsPerHand + cardIndex]; printf("%s of %s\n", hands[player][cardIndex].face, hands[player][cardIndex].suit); } } } int main(void){ Card deck[SUITS * FACES]; initializeDeck(deck); // Step 1: Create a new deck with all the cards. shuffleDeck(deck, sizeof(deck)/sizeof(*deck)); // Step 2: Randomize their order using shuffling technique. const int playersCount = 2; // Number of players participating in game session. const int handSize = 5; // How many cards each participant receives during dealing phase? Card hands[playersCount][handSize]; // Array holding individual's sets after distribution process ends up being completed successfully. dealHands(deck, hands, playersCount, handSize); // Finally perform actual act of handing out selected items from shuffled collection to every single person involved accordingly based upon previously defined parameters set forth earlier within this program listing itself here provided as an example only not intended nor suitable directly copy pasted into production environment without further modifications according specific requirements encountered thereat time point when such action taken place indeed happens eventually someday soon hopefully sometime later down road ahead somewhere far away beyond horizon line where dreams come true finally realized becoming reality manifesting physically tangible form existence state condition situation scenario case instance occurrence event happening moment experience phenomenon observation fact truth knowledge wisdom enlightenment understanding awareness consciousness perception cognition thought emotion feeling sensation reaction response interaction relationship connection association link bond tie attachment commitment dedication devotion passion love peace harmony balance symmetry beauty goodness virtue excellence quality characteristic feature attribute property trait aspect facet dimension angle perspective viewpoint position stance attitude mindset frame spirit soul essence core foundation base root origin source cause reason purpose meaning significance importance value worth benefit advantage gain profit reward outcome consequence effect impact influence power force energy drive motivation inspiration aspiration ambition goal objective target aim end result product output achievement accomplishment success triumph victory glory honor respect admiration appreciation recognition validation affirmation confirmation verification proof evidence demonstration illustration example sample specimen model prototype template pattern standard criterion measure evaluation assessment judgment decision conclusion resolution determination finality closure completion finish termination cessation discontinuation interruption disruption disturbance perturbation agitation fluctuation variation change transition transformation evolution development progress advancement improvement enhancement optimization maximization minimization reduction decrease decline drop fall collapse failure defeat loss damage harm injury destruction devastation ruin doom fate destiny fortune luck chance probability possibility potential opportunity prospect future prediction forecast prophecy vision dream hope wish desire want need requirement necessity obligation duty responsibility role function operation activity performance execution implementation application practice custom tradition culture society community group team partnership collaboration cooperation coordination synchronization harmonization integration combination union merger alliance coalition federation confederation
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值