卷积网络 (2)
基础卷积函数的变体
实际应用的卷积函数和标准离散卷积有一些不同的。
1. 需要同时应用多个卷积函数,因为一个卷积函数只能抽取一种特征
2. 输入通常也不是网格状实值,可能是向量。例如一个彩色图像有3个通道(红绿蓝)。
- 假设有一个 4-D 核数组 K , 其中元素为
ki,l,m,n , 代表输入的通道 im 行 n 列处的像素与 输出的通道l m 行n 列处的元素之间的联系。- 假设输入 V 中的元素为
vi,j,k 是指通道 ij 行 k 列的值.- 假设输出
Z 和 V 有相同的结构. 如果Z 是通过 V 卷积 no-flipping 的K 生成的。
那么
zi,j,k=∑l,m,nvl,j+m,k+nki,l,m,n
如果打算跳过一些核的位置使得计算消耗变得更小,那么在每个方向取样 s 个像素 此时带有取样的卷基函数为c ,那么这个式子里面下标不明觉厉
zi,j,k=c(K,V,s)i,j,k=∑l,m,n[vl,j×s+m,k×s+nki,l,m,n]
zero-pad
它可以使得输出size变得越来越小
假如没有 zero-pad: 如果图像的大小: m×m , 核: k×k , 那么:
输出的大小: m−k+1×m−k+1
类型 | zero-pad | 输出大小 |
---|---|---|
vaild | 没有 | m−k+1×m−k+1 |
same | 足够的0 | m×m |
full | 足够的0 | m+k−1×m+k−1 |
如何训练
假设损失函数为
J(V,K)
, 在后馈中,得到数组
G
,
为了训练网络,需要计算损失函数关于核中的权值的偏导(向后传播)
如果此层不是网络的底层,那么需要计算损失函数关于输入的梯度(向前传播)
数据类型
通过卷积操作,网络可以处理不同大小的图像,即通过应用不同次数的卷积矫正输入的大小。
卷积的高效
卷积相当于 通过 傅里叶变换将 输入和核函数都转换到频率域,然后通过傅里叶反变换转换到时域中。对于一些问题而言,这比直接的离散卷积要高效的多。
关于数据类型
type | single channel | multi-channel |
---|---|---|
1-D | Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step. | Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint. |
2-D | Audio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output. | Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions. |
3-D | Volumetric data: A common source of this kind of data is medical imaging technology, such as CT scans. | Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame. |
next
Theano or Recurrent and Recursive Nets
Variants of the basic convolution function
The convolution function used in the neural networks differ from standard discrete convolution operation.
- many applications of convolution in parallel. Coz a single kernel can only extract one kind of feature.
- input is usually not just a grid of real values, but a grid of vector-valued observations. e.g. A color image has 3 channals.
- Assume that we have a 4-D kernel array K with elements
ki,l,m,n , giving the connection strength between a unit in channel i of the output and a unit in channel l of the input, with an offset ofm rows and n columns between the output unit and the input unit.- Assume our input consists of observed data
V with element vi,j,k giving the value of the input unit within channel i at rowj and column k .- Assume our output consists of
Z with the same format as V . IfZ is produced by convolving K acrossV without flipping K ,
then
zi,j,k=∑l,m,nvl,j+m,k+nki,l,m,n
If we want to skip over some positions of the kernel in order to reduce the computational cost. Thus we sample only every s pixels in each direction, by which we define a downsampled convolution functionc such that:I do not understand what this means
zi,j,k=c(K,V,s)i,j,k=∑l,m,n[vl,j×s+m,k×s+nki,l,m,n]
zero-pad
It can prevent the width of output from shrinking.
Without zero-pad: If the image: m×m , the kernel: k×k , then:
output: m−k+1×m−k+1
type | zero-pad | the size of output |
---|---|---|
vaild | 0 | m−k+1×m−k+1 |
same | enough | m×m |
full | enough | m+k−1×m+k−1 |
how to train
Assume the loss function is
J(V,K)
, during the BP we recieve a array
G
, and
To train the network, we need to compute the derivatives with respect to the weights in the kernel.To do so, we can use a function
If this layer is not the bottom layer of the network, we’ll need to compute the gradient with respect to V in order to backpropagate the error farther down. To do so, we can use a function
Data types
With convolution operation, the networks can process images with different width and length by that the kernel is simply applied a different number of times depending on the size of the input.
Efficient convolution algorithms
Convolution is equivalent to converting both the input and the kernel to the frequency domain using a Fourier transform, performing point-wise multiplication of the two signals, and converting back to the time domain using an inverse Fourier transform. For some problem sizes, this can be faster than the naive implementation of discrete convolution.
more information about data types
type | single channel | multi-channel |
---|---|---|
1-D | Audio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step. | Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint. |
2-D | Audio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output. | Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions. |
3-D | Volumetric data: A common source of this kind of data is medical imaging technology, such as CT scans. | Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame. |
next
Theano or Recurrent and Recursive Nets