神经网络中的不变性

最新推荐文章于 2023-08-09 15:53:44 发布

转载最新推荐文章于 2023-08-09 15:53:44 发布 · 2.6k 阅读

深度学习笔记专栏收录该内容

56 篇文章

订阅专栏

本文探讨了卷积神经网络(CNN)中卷积层(conv层)的等变性，即无论特征在输入图像中如何平移，conv层都能识别；以及通过全连接层实现的平移不变性，使CNN能识别物体位置变化。同时，讨论了通过数据增强学习旋转不变性和尺度不变性的方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

个人认为cnn中conv层对应的是“等变性”（Equivariance），由于conv层的卷积核对于特定的特征才会有较大激活值，所以不论上一层特征图谱（feature map）中的某一特征平移到何处，卷积核都会找到该特征并在此处呈现较大的激活值。这应该就是“等变性”

这种“等变性”是由conv层的 1局部连接 2权值共享两个特性得到的。
所谓的“形变不变性”，应该指的是，若是特征有较小的形变，那么激活值不会有太多的变化。

个人觉得没有“旋转不变性”，只能通过data argumentation这种方式（人为地对样本做镜像、旋转、缩放等操作）让CNN自己去学习旋转不变性。
而所谓的“尺度不变性”，个人觉得应该是由类似于SSD这种对不同尺寸（分辨率）的feature map作相同的处理得到的。CNN本身并不具备尺度不变性。

真正的“不变性”（invariation）应该对应于pooling层，以max-pooling为例，若是2x2的池化，那么特征在2x2的区域中平移，得到的结果是一样的。越往后的pooling层，2x2的核的感受野就越大、所允许的平移也就越大。个人感觉这种“不变性”往往是针对分类任务而言的。（其实说白了，检测任务就是对局部区域的分类，分割任务就是对单个像素的分类）

还有一篇博文对CNN 对于图像特征的各种不变性做了比较生动的解释：
http://blog.youkuaiyun.com/xiaojiajia007/article/details/78396319（原文）

1. 一种解释

（该部分主要说明 Fully Connect 层贡献了CNN分类任务的 invariation ）

After some thought, I do not believe that pooling operations are responsible for the translation invariant property in CNNs. I believe that invariance (at least to translation) is due to the convolution filters (not specifically the pooling) and due to the fully-connected layer.

For instance, let's use the Fig. 1 as reference:

The blue volume represents the input image, while the green and yellow volumes represent layer 1 and layer 2 output activation volumes (see CS231n Convolutional Neural Networks for Visual Recognition if you are not familiar with these volumes). At the end, we have a fully-connected layer that is connected to all activation points of the yellow volume.

These volumes are build using a convolution plus a pooling operation. The pooling operation reduces the height and width of these volumes, while the increasing number of filters in each layer increases the volume depth.

For the sake of the argument, let's suppose that we have very "ludic" filters, as show in Fig. 2:

the first layer filters (which will generate the green volume) detect eyes, noses and other basic shapes (in real CNNs, first layer filters will match lines and very basic textures);
The second layer filters (which will generate the yellow volume) detect faces, legs and other objects that are aggregations of the first layer filters. Again, this is only an example: real life convolution filters may detect objects that have no meaning to humans.

Now suppose that there is a face at one of the corners of the image (represented by two red and a magenta point). The two eyes are detected by the first filter, and therefore will represent two activations at the first slice of the green volume. The same happens for the nose, except that it is detected for the second filter and it appears at the second slice. Next, the face filter will find that there are two eyes and a nose next to each other, and it generates an activation at the yellow volume (within the same region of the face at the input image). Finally, the fully-connected layer detects that there is a face (and maybe a leg and an arm detected by other filters) and it outputs that it has detected an human body.

Now suppose that the face has moved to another corner of the image, as shown in Fig. 3:

The same number of activations occurs in this example, however they occur in a different region of the green and yellow volumes. Therefore, any activation point at the first slice of the yellow volume means that a face was detected, INDEPENDENTLY of the face location. Then the fully-connected layer is responsible to "translate" a face and two arms to an human body. In both examples, an activation was received at one of the fully-connected neurons. However, in each example, the activation path inside the FC layer was different, meaning that a correct learning at the FC layer is essential to ensure the invariance property.

It must be noticed that the pooling operation only "compresses" the activation volumes, if there was no pooling in this example, an activation at the first slice of the yellow volume would still mean a face.

In conclusion, what makes a CNN invariant to object translation is the architecture of the neural network: the convolution filters and the fully-connected layer. Additionally, I believe that if a CNN is trained showing faces only at one corner, during the learning process, the fully-connected layer may become insensitive to faces in other corners.

2. 另一种解释

In addition to the answers already here feature learning in convnets is guided by an error signal that is backpropagated throughout the network, from the output layer all the way back to the input layer.

Each neuron in a particular layer has a small receptive field which scans the whole preceding layer, hence in a typical convnet layer each neuron get's a chance to learn a distinct feature in a particular image or data irrespective of spatial positioning of that feature, since the convolution operation will always find that feature even when it undergoes translation. If the receptive fields don't convolve over the whole image or stimuli, it would not be possible for convnet neurons to learn those translation equivariant features.

Plus the series of alternating layers between a convolutional and pooling, mostly max-pooling layer helps the convnet build up tolerances to severe distortions of the input stimuli. The effective receptive fields of the neurons also become bigger higher up the hierarchy due to the pooling operation, making the convnet process context and integrate features over a large spatial extent. This also makes them very robust and capable of recognizing novel stimuli.

Other forms of invariances are built up artificially by rotating, mirroring and scaling up the training examples. This is because it is important to see training sets from different points of view in order to generalize better.
---------------------
作者：voxel_grid
来源：优快云
原文：https://blog.youkuaiyun.com/voxel_grid/article/details/79275637
版权声明：本文为博主原创文章，转载请附上博文链接！