The task of visual object tracking with Siamese networks, referred as Siamese tracking, transforms the problem of tracking into similarity estimation between a template frame and sampled region from a candidate frame.
孪生网络是把追踪任务描述成template和search region之间相似度响应的问题
Although Siamese trackers are generally shown to work well, they are prone to failure under challenges such as partial occlusion、scale change or when one of the two inputs is rotated
虽然Siamese表现很好,但是在遮挡、尺度变化和旋转的时候,会容易追踪失败
The CNN archietectures used in Siamese trackers are not inherently equivariant to in-plane rotations of the target. The implication is that the model may perform well on object orientations that are represented in the training set, but may fail on other previously unseen orientations
Siamese中的CNN框架实质上并不具有平面内旋转等变性,这意味着,模型会在训练集中表示过的目标方向下运行优良,但是在其他没有产生过的方向下,模型表现会失效
A straightforward approach to enforce learning of rotated variants is to use training dataset where in-plane rotations occur naturally or through data augmentation
一个直接的强迫模型学习旋转变量的方法就是使用具有自然旋转的信息的数据集或者通过数据增强
Limitations of Data-Augmentation
1. Such procedures would require learning separate representations for different rotated variants of the data
这样会让模型去学习数据的不同旋转变量的表达
2. The more variations are considered, the more flexible tracker model needs to be to capture them all
要考虑的变量越多,模型需要越灵活,从而去捕捉更多的变量
3. Futher, such an approach would make the model invariant to rotations, thus making the predictions unreliable when the target is surrounded by similar objects, e.g.,tracking a fish in a school of fishes.
而且,这种方法会让模型具有旋转不变性,因此会让预测变得不可靠,例如在一群鱼中寻找一条鱼
Exemple demonstrating rotation non-equivariance in regular CNN models used in object tracking:
例子描述了常规的CNN模型在目标追踪中,不具有旋转等变性
ψ
θ
(
f
(
/
c
d
o
t
)
)
≠
f
(
ψ
θ
(
⋅
)
)
\psi_\theta(f(/cdot)) \neq f(\psi_\theta(\cdot))
ψθ(f(/cdot))=f(ψθ(⋅))
Equivariant等变性:
算符和函数间能够互相交换,存在对易性
t
r
a
n
s
f
o
r
m
[
F
(
x
)
]
=
F
(
t
r
a
n
s
f
o
r
m
[
x
]
)
transform[F(x)] = F(transform[x])
transform[F(x)]=F(transform[x]) Invariant不变性:
输入x发生变换,但是F之后的输出不变
F
(
x
)
=
F
(
t
r
a
n
s
f
o
r
m
[
x
]
)
F(x) = F(transform[x])
F(x)=F(transform[x]) Covariant共变性:
输入x发生变换transform,F之后的输出也发生变换,但不是transform,但是可以通过另一种变换,让结果相同
t
r
a
n
s
f
o
r
m
∗
F
(
x
)
=
F
(
t
r
a
n
s
f
o
r
m
[
x
]
)
transform^*F(x) = F(transform[x])
transform∗F(x)=F(transform[x])
2. Related Work
Equivariant CNNs
SiamRPN++ proposed a training strategy which removes the spatial bias introduced in non fully-convolutional backbone
SiamRPN++提出一个训练策略,就是移除了backbone中的spatial偏置
Deeper and wider siamese networks for real-time visual tracking showed that existing tracking models induce positional bias, which breaks strict translation equivariance
Deeper and wider siamese networks for real-time visual tracking 指出,现有追踪模型引起了位置偏置,打破了等变变换
Scale Equivariance Improves Siamese Tracking(SE-SiamNet) introduced scale-equivariant Siamese trackers which is crucial when the camera zooms its lens or when the target moves into depth
Scale Equivariance Improves Siamese Tracking(SE-SiamNet)引入尺度等变性孪生网络,在相机伸缩镜头或者目标在景深中移动时影响巨大
3. Rotation Equivariant CNNs
旋转等变性背景
Rotation Equivariance旋转等变性
SFC-NNs
Learning steerable filters for rotation equivariant cnns indicated that one of the more robust ways of enforcing rotation equivariance in CNNs is through the use of steerable filter(SFC-NNs)
Learning steerable filters for rotation equivariant cnns指出,让CNNs具有旋转等变性的一个比较鲁邦的方式是使用可控滤波器(SFC-NNs)
For rotation equivariance with steerable filters, the network must perform convolutions with different rotated versions of each filter
使用可控滤波器的旋转等变性,需要网络的每个卷积滤波器都对应一个不同的旋转
Steerable filters not only facilitate efficiently computing responses for an arbitrary number of discrete filter rotations, but they also exhibit strong expressive power as well
可控滤波器不仅能让计算任意数量离散滤波器的旋转的响应更有效,还很强力
\qquad
k
∈
Z
其值跟函数系的当前函数次数
j
相关
Z
∈
[
−
j
,
j
]
k \in Z其值跟函数系的当前函数次数j相关 Z \in [-j,j]
k∈Z其值跟函数系的当前函数次数j相关Z∈[−j,j]
\qquad
用欧拉旋转定理表示目标的旋转
ρ
θ
ψ
j
k
(
x
)
=
e
−
i
k
θ
ψ
j
k
(
x
)
\qquad \\ \rho_{\theta}\psi_{jk}(x) = e^{-ik\theta}\psi_{jk}(x) \\ \qquad \\
ρθψjk(x)=e−ikθψjk(x)
e
−
i
k
θ
表示顺时针旋转
θ
,
e
+
i
k
θ
表示逆时针旋转
θ
e^{-ik\theta}表示顺时针旋转\theta,e^{+ik\theta}表示逆时针旋转\theta
e−ikθ表示顺时针旋转θ,e+ikθ表示逆时针旋转θ
\qquad
注意,这里的
ψ
j
k
(
x
)
指的是
ψ
j
k
(
⋅
)
,
x
是泛指,而非特指
注意,这里的\psi_{jk}(x)指的是\psi_{jk}(\cdot),x是泛指,而非特指
注意,这里的ψjk(x)指的是ψjk(⋅),x是泛指,而非特指
\qquad
每个学到的权重
w
j
k
∈
C
,被构建为一个基本滤波器之间的线性连接
每个学到的权重w_{jk} \in \mathbb{C},被构建为一个基本滤波器之间的线性连接
每个学到的权重wjk∈C,被构建为一个基本滤波器之间的线性连接
Ψ
(
x
)
=
∑
j
=
1
J
∑
k
=
0
K
w
j
k
ψ
j
k
(
x
)
\qquad \\ \Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}\psi_{jk}(x) \\ \qquad \\
Ψ(x)=j=1∑Jk=0∑Kwjkψjk(x)
ρ
θ
Ψ
(
x
)
=
∑
j
=
1
J
∑
k
=
0
K
w
j
k
e
−
i
k
θ
ψ
j
k
(
x
)
\qquad \\ \rho_{\theta}\Psi(x) = \sum_{j=1}^{J}\sum_{k=0}^{K}w_{jk}e^{-ik\theta}\psi_{jk}(x) \\ \qquad \\
ρθΨ(x)=j=1∑Jk=0∑Kwjke−ikθψjk(x)
通过
Ψ
的实部可以求取滤波器的一个旋转方向,称之为
R
e
Ψ
(
x
)
通过\Psi的实部可以求取滤波器的一个旋转方向, 称之为Re\Psi(x)
通过Ψ的实部可以求取滤波器的一个旋转方向,称之为ReΨ(x)
\qquad
\qquad
4. Rotation Equivariant Siamese Trackers
\qquad
4.1 Formulation Based on Siam-FC
Author started from and modified the basic SiamFC model due to its simple design.
作者选择在SiamFC的基础上进行修改,是因为它简单
h
(
z
,
x
)
=
f
(
z
)
∗
f
(
x
)
\qquad \\ h(z,x)=f(z)*f(x) \\ \qquad \\
h(z,x)=f(z)∗f(x)
f
(
⋅
)
是指特征提取网络
\qquad f(\cdot)是指特征提取网络
f(⋅)是指特征提取网络
∗
指互相关的卷积操作
\qquad * 指互相关的卷积操作
∗指互相关的卷积操作
For rotational Siamese tracker, author introduced rotation equivariant modules and a group max pooling module that selects the cross-correlation encoding for the most approximate orientations among the multiple heatmaps generated in setup
作者引入了旋转等变模块和分组最大池化,分组最大池化用来从生成的众多热图中,选择出最近似的方向的互相关编码
网络的Template Head修改成可以输入多个template image(如图,旋转后的template)作为输入,一系列旋转变量
Λ
\Lambda
Λ定义为Z集,其中
Z
=
{
z
1
,
z
2
,
…
,
z
Λ
}
Z=\{z_{1}, z_{2},\dots, z_{\Lambda}\}
Z={z1,z2,…,zΛ},即为所有可能存在的旋转角度
\qquad
先计算初始traget的特征
f
(
z
)
f(z)
f(z),然后再旋转
f
(
z
)
f(z)
f(z),由于是旋转等变网络,所以理论上是可以这么干的
\qquad
旋转Template中的Target:
\qquad
y
c
~
(
1
)
(
x
,
θ
)
=
R
e
∑
c
=
1
C
∑
j
=
1
J
∑
k
=
0
K
w
c
^
c
j
k
e
−
i
k
θ
(
I
c
∗
ψ
j
k
)
(
x
)
\qquad \\ y_{\tilde{c}}^{(1)}(x,\theta) = Re \sum_{c=1}^{C}\sum_{j=1}^{J}\sum_{k=0}^{K}w_{\hat{c}cjk}e^{-ik\theta}(I_c * \psi_{jk})(x) \\ \qquad \\
yc~(1)(x,θ)=Rec=1∑Cj=1∑Jk=0∑Kwc^cjke−ikθ(Ic∗ψjk)(x) 其中
I
c
是通道为
c
的图片,
c
∈
{
1
,
2
,
…
,
C
}
I_c是通道为c的图片,c \in \{ 1, 2, \dots, C\}
Ic是通道为c的图片,c∈{1,2,…,C}
ρ
θ
Ψ
c
^
c
(
1
)
旋转滤波器
\rho_{\theta}\Psi_{\hat{c}c}^{(1)}旋转滤波器
ρθΨc^c(1)旋转滤波器
c
^
∈
{
1
,
2
,
…
,
C
^
}
\hat{c} \in \{1, 2,\dots, \hat{C} \}
c^∈{1,2,…,C^}
y
c
^
(
l
)
=
R
e
∑
c
=
1
C
∑
ϕ
∈
Θ
∑
j
,
k
w
c
^
c
j
k
,
θ
−
ϕ
e
−
i
k
θ
(
ζ
c
l
−
1
(
,
˙
ϕ
)
∗
ψ
j
k
)
(
x
)
\qquad \\ y_{\hat{c}}^{(l)} = Re\sum_{c=1}^{C}\sum_{\phi \in \Theta}\sum_{j,k}w_{\hat{c}cjk,\theta - \phi}\hspace{1mm}e^{-ik\theta}(\zeta_c^{l-1}(\dot, \phi)*\psi_{jk})(x) \\ \qquad \\
yc^(l)=Rec=1∑Cϕ∈Θ∑j,k∑wc^cjk,θ−ϕe−ikθ(ζcl−1(,˙ϕ)∗ψjk)(x) 其中
权重项
w
中的下标
θ
−
ϕ
是指以角度维度进行的分组卷积操作
权重项w中的下标\theta-\phi是指以角度维度进行的分组卷积操作
权重项w中的下标θ−ϕ是指以角度维度进行的分组卷积操作
从
R
e
−
S
i
a
m
N
e
t
的两个子网络可以得到一个
f
e
a
t
u
r
e
−
m
a
p
集合
{
ϕ
(
z
)
和
ϕ
(
x
)
}
从Re-SiamNet的两个子网络可以得到一个feature-map集合\{\phi(z)和\phi(x)\}
从Re−SiamNet的两个子网络可以得到一个feature−map集合{ϕ(z)和ϕ(x)}
\qquad
ϕ
(
z
)
是转动角度
Λ
的
f
e
a
t
u
r
e
−
m
a
p
集合
\phi(z)是转动角度\Lambda的feature-map集合
ϕ(z)是转动角度Λ的feature−map集合
\qquad
通过互相关层
{
h
^
(
z
,
x
)
}
,计算不同旋转角度
Λ
的
T
e
m
p
l
a
t
e
特征图的热图,
h
i
(
z
,
x
)
=
ϕ
(
z
)
∗
ϕ
(
x
)
通过互相关层\{\hat{h}(z,x)\},计算不同旋转角度\Lambda的Template特征图的热图,h_i(z, x)=\phi(z)*\phi(x)
通过互相关层{h^(z,x)},计算不同旋转角度Λ的Template特征图的热图,hi(z,x)=ϕ(z)∗ϕ(x)
\qquad
将
{
h
^
(
z
,
x
)
}
经过全局最大池化,输出一个热图
h
(
Z
,
x
)
,
即在
{
h
^
(
z
,
x
)
}
中挑出最大的
h
^
将\{\hat{h}(z, x)\}经过全局最大池化,输出一个热图h(Z,x),即在\{\hat{h}(z,x)\}中挑出最大的\hat{h}
将{h^(z,x)}经过全局最大池化,输出一个热图h(Z,x),即在{h^(z,x)}中挑出最大的h^
\qquad
\qquad
4.2 Constructing RE-SiamNet Framework
Identify the precision of the tracker in terms of discriminating between orientations of the rotational degree of freedom. Author considered here
Λ
\Lambda
Λ rotation groups, based on which RE-SiamNets would be perfectly equivariant to angles defined by the set
Θ
=
{
(
i
−
1
)
Λ
∗
2
π
}
i
=
1
Λ
⇒
{
(
i
−
1
)
2
π
8
}
i
=
1
Λ
=
8
\Theta=\{\frac{(i-1)}{\Lambda}*2\pi\}_{i=1}^{\Lambda} \Rightarrow \{(i-1)\frac{2\pi}{8}\}_{i=1}^{\Lambda=8}
Θ={Λ(i−1)∗2π}i=1Λ⇒{(i−1)82π}i=1Λ=8
就不同旋转角度之间差异,区分追踪器的精度。作者这里使用了一组等差角度集合,如公式所示
Define the non-parametric encoding
ϕ
(
⋅
)
\phi(\cdot)
ϕ(⋅)based on existing Siamese trackers. Based on the choice of
ϕ
(
⋅
)
\phi(\cdot)
ϕ(⋅),discriminative power of trackers varies.
Instead of a single convolution to generate
h
=
(
z
,
x
)
,
Λ
h=(z,x),\Lambda
h=(z,x),Λ convolutions are performed to generate
Λ
\Lambda
Λ different heatmap
8个卷积生成了8个不同的热图,取缔掉单一卷积生成的单一热图
Perform Global max-pooling over the feature maps to generate
h
(
Z
,
x
)
h(Z,x)
h(Z,x), which is then processed to localize the target.
在生成的8组特征图中进行的全局最大池化,会被送入head处理进行目标定位。
$\qquad\$
5. Unsupervised Relative Rotation Estimation
\qquad
5.1 Unsupervised 2D pose estimation
\qquad
The inherent design of RE-SiamNet allows to obtain an estimation of the relative changes of 2D pose of the target in a fully unsupervised manner. This information can be obtained from the result of the group maxpooling step
Let
i
∈
{
1
,
2
,
…
,
Γ
}
i \in \{1,2,\dots, \Gamma\}
i∈{1,2,…,Γ} denote one of
Λ
\Lambda
Λ orientations of the template. Then,
i
i
i is the number of rotation groups by which the pose of the template differs from that of its appearance in the candidate image if :
h
(
Z
,
x
)
=
h
^
(
z
i
,
x
)
=
g
r
o
u
p
−
m
a
x
p
o
o
l
(
{
z
,
x
}
)
h(Z,x)=\hat{h}(z_i, x)=group-maxpool(\{z, x\})
h(Z,x)=h^(zi,x)=group−maxpool({z,x})