第 11 章 CNNs(2)

最新推荐文章于 2025-07-27 21:00:36 发布

原创最新推荐文章于 2025-07-27 21:00:36 发布 · 483 阅读

0 ·

CC 4.0 BY-SA版权

个人兴趣同时被 3 个专栏收录

24 篇文章

订阅专栏

Interest

8 篇文章

订阅专栏

深度学习小白

6 篇文章

订阅专栏

本文深入探讨了卷积网络的基础卷积函数变体及其训练方法，包括zero-pad的应用、高效卷积算法等，并介绍了不同数据类型下的实际应用场景。

卷积网络 2

卷积网络 (2)

基础卷积函数的变体

实际应用的卷积函数和标准离散卷积有一些不同的。
1. 需要同时应用多个卷积函数，因为一个卷积函数只能抽取一种特征
2. 输入通常也不是网格状实值，可能是向量。例如一个彩色图像有3个通道（红绿蓝）。

假设有一个 4-D 核数组 $K$ ，其中元素为 $k_{i,l,m,n}$ , 代表输入的通道 $i$ $m$ 行 $n$ 列处的像素与输出的通道 $l$ $m$ 行 $n$ 列处的元素之间的联系。
假设输入 $V$ 中的元素为 $v_{i,j,k}$ 是指通道 $i$ $j$ 行 $k$ 列的值.

假设输出 $Z$ 和 $V$ 有相同的结构. 如果 $Z$ 是通过 $V$ 卷积 no-flipping 的 $K$ 生成的。
那么
$z i, j, k = \sum l, m, n v l, j + m, k + n k i, l, m, n$ $z_{i,j,k}=\sum_{l,m,n}v_{l,j+m,k+n}k_{i,l,m,n}$
如果打算跳过一些核的位置使得计算消耗变得更小，那么在每个方向取样 $s$ 个像素此时带有取样的卷基函数为 $c$ ，那么 这个式子里面下标不明觉厉
$z i, j, k = c (K, V, s) i, j, k = \sum l, m, n [v l, j \times s + m, k \times s + n k i, l, m, n]$ $z_{i,j,k}=c(K,V,s)_{i,j,k}=\sum_{l,m,n}[v_{l,j\times s+m,k\times s+n}k_{i,l,m,n}]$

zero-pad

它可以使得输出size变得越来越小
假如没有 zero-pad: 如果图像的大小: $m\times m$ , 核: $k\times k$ , 那么:
输出的大小: $m-k+1\times m-k+1$

类型	zero-pad	输出大小
vaild	没有	$m-k+1\times m-k+1$
same	足够的0	$m\times m$
full	足够的0	$m+k-1\times m+k-1$

如何训练

假设损失函数为 $J(V,K)$ , 在后馈中，得到数组 $G$ , $G_{i,j,k}=\frac{\partial}{\partial z_{i,j,k}} J(V,K)$

为了训练网络，需要计算损失函数关于核中的权值的偏导（向后传播）

g (G, V, s) i, j, k l = \partial \partial k i , j , k , l J (V, K) = \sum m, n g i, m, n v j, m \times s + k, n \times s + l

$g(G,V,s)_{i,j,kl}=\frac{\partial}{\partial k_{i,j,k,l}} J(V,K) = \sum_{m,n}g_{i,m,n}v_{j,m\times s+k, n\times s+l}$

如果此层不是网络的底层，那么需要计算损失函数关于输入的梯度（向前传播）

h (K, G, s) i, j, k = \partial \partial v i , j , k J (V, K) = \sum l, m | s \times l + m = j \sum n, p | s \times n + p = k \sum q k q, i, m, p g i, l, n

$h(K,G,s)_{i,j,k}=\frac{\partial}{\partial v_{i,j,k}}J(V,K)=\sum_{l,m|s\times l+m=j} \sum_{n,p|s\times n+p=k} \sum_q k_{q,i,m,p}g_{i,l,n}$

数据类型

通过卷积操作，网络可以处理不同大小的图像，即通过应用不同次数的卷积矫正输入的大小。

卷积的高效

卷积相当于通过傅里叶变换将输入和核函数都转换到频率域，然后通过傅里叶反变换转换到时域中。对于一些问题而言，这比直接的离散卷积要高效的多。

关于数据类型

type	single channel	multi-channel
1-D	Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step.	Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint.
2-D	Audio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output.	Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions.
3-D	Volumetric data: A common source of this kind of data is medical imaging technology, such as CT scans.	Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

Theano or Recurrent and Recursive Nets

Variants of the basic convolution function

The convolution function used in the neural networks differ from standard discrete convolution operation.

many applications of convolution in parallel. Coz a single kernel can only extract one kind of feature.
input is usually not just a grid of real values, but a grid of vector-valued observations. e.g. A color image has 3 channals.

Assume that we have a 4-D kernel array $K$ with elements $k_{i,l,m,n}$ , giving the connection strength between a unit in channel i of the output and a unit in channel $l$ of the input, with an offset of $m$ rows and $n$ columns between the output unit and the input unit.

Assume our input consists of observed data $V$ with element $v_{i,j,k}$ giving the value of the input unit within channel $i$ at row $j$ and column $k$ .

Assume our output consists of $Z$ with the same format as $V$ . If $Z$ is produced by convolving $K$ across $V$ without flipping $K$ ,
then
$z i, j, k = \sum l, m, n v l, j + m, k + n k i, l, m, n$ $z_{i,j,k}=\sum_{l,m,n}v_{l,j+m,k+n}k_{i,l,m,n}$
If we want to skip over some positions of the kernel in order to reduce the computational cost. Thus we sample only every $s$ pixels in each direction, by which we define a downsampled convolution function $c$ such that: I do not understand what this means
$z i, j, k = c (K, V, s) i, j, k = \sum l, m, n [v l, j \times s + m, k \times s + n k i, l, m, n]$ $z_{i,j,k}=c(K,V,s)_{i,j,k}=\sum_{l,m,n}[v_{l,j\times s+m,k\times s+n}k_{i,l,m,n}]$

zero-pad

It can prevent the width of output from shrinking.
Without zero-pad: If the image: $m\times m$ , the kernel: $k\times k$ , then:
output: $m-k+1\times m-k+1$

type	zero-pad	the size of output
vaild	0	$m-k+1\times m-k+1$
same	enough	$m\times m$
full	enough	$m+k-1\times m+k-1$

how to train

Assume the loss function is $J(V,K)$ , during the BP we recieve a array $G$ , and $G_{i,j,k}=\frac{\partial}{\partial z_{i,j,k}} J(V,K)$

To train the network, we need to compute the derivatives with respect to the weights in the kernel.To do so, we can use a function

g (G, V, s) i, j, k l = \partial \partial k i , j , k , l J (V, K) = \sum m, n g i, m, n v j, m \times s + k, n \times s + l

$g(G,V,s)_{i,j,kl}=\frac{\partial}{\partial k_{i,j,k,l}} J(V,K) = \sum_{m,n}g_{i,m,n}v_{j,m\times s+k, n\times s+l}$

If this layer is not the bottom layer of the network, we’ll need to compute the gradient with respect to V in order to backpropagate the error farther down. To do so, we can use a function

h (K, G, s) i, j, k = \partial \partial v i , j , k J (V, K) = \sum l, m | s \times l + m = j \sum n, p | s \times n + p = k \sum q k q, i, m, p g i, l, n

$h(K,G,s)_{i,j,k}=\frac{\partial}{\partial v_{i,j,k}}J(V,K)=\sum_{l,m|s\times l+m=j} \sum_{n,p|s\times n+p=k} \sum_q k_{q,i,m,p}g_{i,l,n}$

Data types

With convolution operation, the networks can process images with different width and length by that the kernel is simply applied a different number of times depending on the size of the input.

Efficient convolution algorithms

Convolution is equivalent to converting both the input and the kernel to the frequency domain using a Fourier transform, performing point-wise multiplication of the two signals, and converting back to the time domain using an inverse Fourier transform. For some problem sizes, this can be faster than the naive implementation of discrete convolution.

more information about data types

type	single channel	multi-channel
1-D	Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step.	Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint.
2-D	Audio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output.	Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions.
3-D	Volumetric data: A common source of this kind of data is medical imaging technology, such as CT scans.	Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

Theano or Recurrent and Recursive Nets