第 11 章 CNNs(2)

本文深入探讨了卷积网络的基础卷积函数变体及其训练方法,包括zero-pad的应用、高效卷积算法等,并介绍了不同数据类型下的实际应用场景。

卷积网络 (2)


基础卷积函数的变体

实际应用的卷积函数和标准离散卷积有一些不同的。
1. 需要同时应用多个卷积函数,因为一个卷积函数只能抽取一种特征
2. 输入通常也不是网格状实值,可能是向量。例如一个彩色图像有3个通道(红绿蓝)。

  • 假设有一个 4-D 核数组 K , 其中元素为 ki,l,m,n , 代表输入的通道 i m n 列处的像素与 输出的通道l m n列处的元素之间的联系。
  • 假设输入 V 中的元素为 vi,j,k 是指通道 i j k 列的值.
  • 假设输出 Z V 有相同的结构. 如果 Z 是通过 V 卷积 no-flipping 的K 生成的。
    那么
    zi,j,k=l,m,nvl,j+m,k+nki,l,m,n

    如果打算跳过一些核的位置使得计算消耗变得更小,那么在每个方向取样 s 个像素 此时带有取样的卷基函数为c,那么 这个式子里面下标不明觉厉
    zi,j,k=c(K,V,s)i,j,k=l,m,n[vl,j×s+m,k×s+nki,l,m,n]

zero-pad

它可以使得输出size变得越来越小
假如没有 zero-pad: 如果图像的大小: m×m , 核: k×k , 那么:
输出的大小: mk+1×mk+1

类型zero-pad输出大小
vaild没有 mk+1×mk+1
same足够的0 m×m
full足够的0 m+k1×m+k1

如何训练

假设损失函数为 J(V,K) , 在后馈中,得到数组 G , Gi,j,k=zi,j,kJ(V,K)

为了训练网络,需要计算损失函数关于核中的权值的偏导(向后传播)

g(G,V,s)i,j,kl=ki,j,k,lJ(V,K)=m,ngi,m,nvj,m×s+k,n×s+l

如果此层不是网络的底层,那么需要计算损失函数关于输入的梯度(向前传播)

h(K,G,s)i,j,k=vi,j,kJ(V,K)=l,m|s×l+m=jn,p|s×n+p=kqkq,i,m,pgi,l,n

数据类型

通过卷积操作,网络可以处理不同大小的图像,即通过应用不同次数的卷积矫正输入的大小。

卷积的高效

卷积相当于 通过 傅里叶变换将 输入和核函数都转换到频率域,然后通过傅里叶反变换转换到时域中。对于一些问题而言,这比直接的离散卷积要高效的多。

关于数据类型

typesingle channelmulti-channel
1-DAudio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step.Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint.
2-DAudio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output.Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions.
3-DVolumetric data: A common source of this kind of data is medical imaging technology, such as CT scans.Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

next

Theano or Recurrent and Recursive Nets


Variants of the basic convolution function

The convolution function used in the neural networks differ from standard discrete convolution operation.

  1. many applications of convolution in parallel. Coz a single kernel can only extract one kind of feature.
  2. input is usually not just a grid of real values, but a grid of vector-valued observations. e.g. A color image has 3 channals.
  • Assume that we have a 4-D kernel array K with elements ki,l,m,n , giving the connection strength between a unit in channel i of the output and a unit in channel l of the input, with an offset of m rows and n columns between the output unit and the input unit.
  • Assume our input consists of observed data V with element vi,j,k giving the value of the input unit within channel i at row j and column k .
  • Assume our output consists of Z with the same format as V . If Z is produced by convolving K across V without flipping K ,
    then
    zi,j,k=l,m,nvl,j+m,k+nki,l,m,n

    If we want to skip over some positions of the kernel in order to reduce the computational cost. Thus we sample only every s pixels in each direction, by which we define a downsampled convolution function c such that: I do not understand what this means
    zi,j,k=c(K,V,s)i,j,k=l,m,n[vl,j×s+m,k×s+nki,l,m,n]

zero-pad

It can prevent the width of output from shrinking.
Without zero-pad: If the image: m×m , the kernel: k×k , then:
output: mk+1×mk+1

typezero-padthe size of output
vaild0 mk+1×mk+1
sameenough m×m
fullenough m+k1×m+k1

how to train

Assume the loss function is J(V,K) , during the BP we recieve a array G , and Gi,j,k=zi,j,kJ(V,K)

To train the network, we need to compute the derivatives with respect to the weights in the kernel.To do so, we can use a function

g(G,V,s)i,j,kl=ki,j,k,lJ(V,K)=m,ngi,m,nvj,m×s+k,n×s+l

If this layer is not the bottom layer of the network, we’ll need to compute the gradient with respect to V in order to backpropagate the error farther down. To do so, we can use a function

h(K,G,s)i,j,k=vi,j,kJ(V,K)=l,m|s×l+m=jn,p|s×n+p=kqkq,i,m,pgi,l,n

Data types

With convolution operation, the networks can process images with different width and length by that the kernel is simply applied a different number of times depending on the size of the input.

Efficient convolution algorithms

Convolution is equivalent to converting both the input and the kernel to the frequency domain using a Fourier transform, performing point-wise multiplication of the two signals, and converting back to the time domain using an inverse Fourier transform. For some problem sizes, this can be faster than the naive implementation of discrete convolution.

more information about data types

typesingle channelmulti-channel
1-DAudio waveform: The axis we convolve over corresponds to time. We discretize time and measure the amplitude of the waveform once per time step.Skeleton animation data: Animations of 3-D computer-rendered characters are generated by alter- ing the pose of a “skeleton” over time. At each point in time, the pose of the character is described by a specification of the angles of each of the joints in the character’s skeleton. Each channel in the data we feed to the convolutional model represents the angle about one axis of one joint.
2-DAudio data that has been preprocessed with a Fourier transform: We can transform the audio waveform into a 2D array with different rows corresponding to different frequencies and different columns corresponding to different points in time. Using convolution in the time makes the model equivariant to shifts in time. Using convolution across the frequency axis makes the model equivariant to frequency, so that the same melody played in a different octave produces the same representation but at a different height in the network’s output.Color image data: One channel contains the red pixels, one the green pixels, and one the blue pixels. The convolution kernel moves over both the horizontal and vertical axes of the image, conferring translation equivariance in both directions.
3-DVolumetric data: A common source of this kind of data is medical imaging technology, such as CT scans.Color video data: One axis corresponds to time, one to the height of the video frame, and one to the width of the video frame.

next

Theano or Recurrent and Recursive Nets

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值