deep_learning_month4_week1_convolution_model_step_by

本文链接：https://blog.youkuaiyun.com/ttt6688/article/details/80232664

本文详细介绍如何逐步构建卷积神经网络（CNN）中的卷积层，包括卷积操作、池化过程及其反向传播算法。通过具体示例和代码实现，帮助读者深入理解CNN的工作原理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

deep_learning_month4_week1_convolution_model_step_by_step

标签：机器学习深度学习

代码已上传github:
https://github.com/PerfectDemoT/my_deeplearning_homework

说明：本文解释了如何一步步建立CNN的卷积层，不过只包含一步步建立的函数，并没有形成一个能够使用的模型，如果大佬们有兴趣，也是可以自己动手将其整合为一个比较强大的模型的

本文描述了如何正向进行卷积，池化(包含最大池化以及平均池化)。以及反响传播时如何对各变量(W,b,A)求导(dA,dW,db)，以及对于池化的反向传播

deep_learning_month4_week1_convolution_model_step_by_step
- - 1. 首先让我们来看看正向求卷积的操作
  - 2. 反向传播
    - 1. 对于dA , dW , db .求解
    - 2. 池化层的反向传播

1. 首先让我们来看看正向求卷积的操作

1. 先导入包,并设置一下绘图

代码如下，此不赘述

import numpy as np
import h5py
import matplotlib.pyplot as plt

plt.rcParams['figure.figsize'] = (5.0, 4.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

np.random.seed(1)

2. 接下来首先进行Zero-padding操作

padding操作就是相当于在图的四周加上规定的像素。这里给一段解释吧：

It allows you to use a CONV layer without necessarily shrinking the height and width of the volumes. This is important for building deeper networks, since otherwise the height/width would shrink as you go to deeper layers. An important special case is the “same” convolution, in which the height/width is exactly preserved after one layer.
It helps us keep more of the information at the border of an image. Without padding, very few values at the next layer would be affected by pixels as the edges of an image

下面我们来进行zero-padding操作，并且给出代码：

# GRADED FUNCTION: zero_pad

def zero_pad(X, pad):
    """
    Pad with zeros all images of the dataset X. The padding is applied to the height and width of an image, 
    as illustrated in Figure 1.

    Argument:
    X -- python numpy array of shape (m, n_H, n_W, n_C) representing a batch of m images
    pad -- integer, amount of padding around each image on vertical and horizontal dimensions

    Returns:
    X_pad -- padded image of shape (m, n_H + 2*pad, n_W + 2*pad, n_C)
    """

    ### START CODE HERE ### (≈ 1 line)
    X_pad = np.pad(X , ((0 , 0) , (pad , pad) , (pad , pad) , (0 , 0)) , 'constant' , constant_values = 0)
    #说明一下这个用法：由于X是一个四维的，所以后面的也是四个二维的参数，第二个中来嗯个pad表示矩阵左右是否需要加上pad列
    ### END CODE HERE ###

    return X_pad

说明一下np.pad()的用法：

由于X是一个四维的，所以后面的也是四个二维的参数，第二个中来嗯个pad表示矩阵左右是否需要加上pad列

所以输出一下是这样：

np.random.seed(1)
x = np.random.randn(4, 3, 3, 2)
x_pad = zero_pad(x, 2)
print ("x.shape =", x.shape)
print ("x_pad.shape =", x_pad.shape)
print ("x[1,1] =", x[1,1])
print ("x_pad[1,1] =", x_pad[1,1])

fig, axarr = plt.subplots(1, 2)
axarr[0].set_title('x')
axarr[0].imshow(x[0,:,:,0])
axarr[1].set_title('x_pad')
axarr[1].imshow(x_pad[0,:,:,0])

下面是显示：

x.shape = (4, 3, 3, 2)
x_pad.shape = (4, 7, 7, 2)
x[1,1] = [[ 0.90085595 -0.68372786]
 [-0.12289023 -0.93576943]
 [-0.26788808  0.53035547]]
x_pad[1,1] = [[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]

3. 下面进行简单的卷积步骤演示

（这里怕是要让你失望了，因为这一步暂时只是一个位乘。。。）
所以我直接上代码吧，，，没啥可解释的。。。

def conv_single_step(a_slice_prev, W, b):
    """
    Apply one filter defined by parameters W on a single slice (a_slice_prev) of the output activation 
    of the previous layer.

    Arguments:
    a_slice_prev -- slice of input data of shape (f, f, n_C_prev)
    W -- Weight parameters contained in a window - matrix of shape (f, f, n_C_prev)
    b -- Bias parameters contained in a window - matrix of shape (1, 1, 1)

    Returns:
    Z -- a scalar value, result of convolving the sliding window (W, b) on a slice x of the input data
    """

    ### START CODE HERE ### (≈ 2 lines of code)
    # Element-wise product between a_slice and W. Add bias.
    s = np.multiply(a_slice_prev, W) + b
    # Sum over all entries of the volume s
    Z = np.sum(s)
    ### END CODE HERE ###

    return Z

输出一下：

np.random.seed(1)
a_slice_prev = np.random.randn(4, 4, 3)
W = np.random.randn(4, 4, 3)
b = np.random.randn(1, 1, 1)

Z = conv_single_step(a_slice_prev, W, b)
print("Z =", Z)

长这样

Z = -23.16021220252078

4. 现在开始真正的卷积网络前向传播了

（当然，这里不包含全连接层）

注意几个公式：

n H = ⌊ n H p r e v - f + 2 \times p a d s t r i d e ⌋ + 1

$n_H = \lfloor \frac{n_{H_{prev}} - f + 2 \times pad}{stride} \rfloor +1$

n W = ⌊ n W p r e v - f + 2 \times p a d s t r i d e ⌋ + 1

$n_W = \lfloor \frac{n_{W_{prev}} - f + 2 \times pad}{stride} \rfloor +1$

n C = number of filters used in the convolution

$n_C = \text{number of filters used in the convolution}$

另外对于slice,我们姑且叫它矩阵切片吧，我们在代码中有确切的导出过程，那个一定要注意到。
下面看代码：

# GRADED FUNCTION: conv_forward

def conv_forward(A_prev, W, b, hparameters):
    """
    Implements the forward propagation for a convolution function

    Arguments:
    A_prev -- output activations of the previous layer, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
    W -- Weights, numpy array of shape (f, f, n_C_prev, n_C)
    b -- Biases, numpy array of shape (1, 1, 1, n_C)
    hparameters -- python dictionary containing "stride" and "pad"

    Returns:
    Z -- conv output, numpy array of shape (m, n_H, n_W, n_C)
    cache -- cache of values needed for the conv_backward() function
    """

    ### START CODE HERE ###
    # Retrieve dimensions from A_prev's shape (≈1 line)  
    #(m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape
    # Retrieve dimensions from W's shape (≈1 line)
    (f, f, n_C_prev, n_C) = W.shape
    (f, f, n_C_prev, n_C) = W.shape
    # Retrieve information from "hparameters" (≈2 lines)
    stride = hparameters['stride']
    pad = hparameters['pad']

    # Compute the dimensions of the CONV output volume using the formula given above. Hint: use int() to floor. (≈2 lines)
    n_H = 1 + int((n_H_prev + 2 * pad - f) / stride)
    n_W = 1 + int((n_W_prev + 2 * pad - f) / stride)

    # Initialize the output volume Z with zeros. (≈1 line)
    Z = np.zeros((m, n_H, n_W, n_C))

    # Create A_prev_pad by padding A_prev
    A_prev_pad = zero_pad(A_prev, pad)

    #接下来开始卷积操作
    for i in range(m):         #对每一个训练样例                       # loop over the batch of training examples
        a_prev_pad = A_prev_pad[i]       #选择该训练样例                         # Select ith training example's padded activation
        for h in range(n_H):                           # loop over vertical axis of the output volume
            for w in range(n_W):                       # loop over horizontal axis of the output volume
                for c in range(n_C):                   # loop over channels (= #filters) of the output volume

                    #注意：上面的i,h,w,c都是从0开始的

                    # Find the corners of the current "slice" (≈4 lines)
                    #聚焦起始与结束位置
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f

                    # Use the corners to define the (3D) slice of a_prev_pad (See Hint above the cell). (≈1 line)
                    a_slice_prev = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :]

                    # Convolve the (3D) slice with the correct filter W and bias b, to get back one output neuron. (≈1 line)
                    Z[i, h, w, c] = np.sum(np.multiply(a_slice_prev, W[:, :, :, c]) + b[:, :, :, c])

    ### END CODE HERE ###

    # Making sure your output shape is correct
    assert(Z.shape == (m, n_H, n_W, n_C))

    # Save information in "cache" for the backprop
    cache = (A_prev, W, b, hparameters)

    return Z, cache

49-56行，就是矩阵切片得出的地方么，聚焦该起始与结束位置的导出。
（另外部分的解释，在代码的注释里有，注意看）

下面我们输出看看：

np.random.seed(1)
A_prev = np.random.randn(10,4,4,3)
W = np.random.randn(2,2,3,8)
b = np.random.randn(1,1,1,8)
hparameters = {"pad" : 2,
               "stride": 1}

Z, cache_conv = conv_forward(A_prev, W, b, hparameters)
print("Z's mean =", np.mean(Z))
print("cache_conv[0][1][2][3] =", cache_conv[0][1][2][3])

结果如下:

Z's mean = 0.15585932488906465
cache_conv[0][1][2][3] = [-0.20075807  0.18656139  0.41005165]

5. 接下来是池化了

（注意有最大池化和平均池化，据NG说的最常用的是最大池化）

同样，我们贴出对于过滤器位置数学公式：

n H = ⌊ n H p r e v - f s t r i d e ⌋ + 1

$n_H = \lfloor \frac{n_{H_{prev}} - f}{stride} \rfloor +1$

n W = ⌊ n W p r e v - f s t r i d e ⌋ + 1

$n_W = \lfloor \frac{n_{W_{prev}} - f}{stride} \rfloor +1$

n C = n C p r e v

$n_C = n_{C_{prev}}$

对于最大池化以及平均池化的解释如下：

Max-pooling layer: slides an (f,ff,f) window over the input and stores the max value of the window in the output.

Average-pooling layer: slides an (f,ff,f) window over the input and stores the average value of the window in the output.

（这里我默认大家都看过课程了，并且上面已经给出了解释，所以此不赘述）

代码如下：

# GRADED FUNCTION: pool_forward

def pool_forward(A_prev, hparameters, mode = "max"):
    """
    Implements the forward pass of the pooling layer

    Arguments:
    A_prev -- Input data, numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)
    hparameters -- python dictionary containing "f" and "stride"
    mode -- the pooling mode you would like to use, defined as a string ("max" or "average")

    Returns:
    A -- output of the pool layer, a numpy array of shape (m, n_H, n_W, n_C)
    cache -- cache used in the backward pass of the pooling layer, contains the input and hparameters 
    """

    # Retrieve dimensions from the input shape
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape

    # Retrieve hyperparameters from "hparameters"
    f = hparameters["f"]
    stride = hparameters["stride"]

    # Define the dimensions of the output
    n_H = int(1 + (n_H_prev - f) / stride)
    n_W = int(1 + (n_W_prev - f) / stride)
    n_C = n_C_prev

    # Initialize output matrix A
    A = np.zeros((m, n_H, n_W, n_C))              

    ### START CODE HERE ###
    for i in range(m):                         # loop over the training examples
        for h in range(n_H):                     # loop on the vertical axis of the output volume
            for w in range(n_W):                 # loop on the horizontal axis of the output volume
                for c in range (n_C):            # loop over the channels of the output volume

                    # Find the corners of the current "slice" (≈4 lines)
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f

                    # Use the corners to define the current slice on the ith training example of A_prev, channel c. (≈1 line)
                    a_prev_slice = A_prev[i, vert_start:vert_end, horiz_start:horiz_end, c]

                    # Compute the pooling operation on the slice. Use an if statment to differentiate the modes. Use np.max/np.mean.
                    if mode == "max":
                        A[i, h, w, c] = np.max(a_prev_slice)
                    elif mode == "average":
                        A[i, h, w, c] = np.mean(a_prev_slice)

    ### END CODE HERE ###

    # Store the input and hparameters in "cache" for pool_backward()
    cache = (A_prev, hparameters)

    # Making sure your output shape is correct
    assert(A.shape == (m, n_H, n_W, n_C))

    return A, cache

大家可以看到，对于池化，其实前面和卷积有很相似的地方，都是对过滤器的定位。
另外，我们这里调用了np.max以及np.mean，如果不了解的同学，需要去了解一下numpy库里的这两个方法。

下面可以看看这个函数调用：

np.random.seed(1)
A_prev = np.random.randn(2, 4, 4, 3)
hparameters = {"stride" : 1, "f": 4}

A, cache = pool_forward(A_prev, hparameters)
print("mode = max")
print("A =", A)
print()
A, cache = pool_forward(A_prev, hparameters, mode = "average")
print("mode = average")
print("A =", A)

输出如下：

mode = max
A = [[[[ 1.74481176  1.6924546   2.10025514]]]


 [[[ 1.19891788  1.51981682  2.18557541]]]]

mode = average
A = [[[[-0.09498456  0.11180064 -0.14263511]]]


 [[[-0.09525108  0.28325018  0.33035185]]]]

2. 反向传播

1. 对于dA , dW , db .求解

公式贴出如下：

d A + = \sum h = 0 n H \sum w = 0 n W W c \times d Z h w (1)

$dA += \sum _{h=0} ^{n_H} \sum_{w=0} ^{n_W} W_c \times dZ_{hw} \tag{1}$

d W c + = \sum h = 0 n H \sum w = 0 n W a s l i c e \times d Z h w (2)

$dW_c += \sum _{h=0} ^{n_H} \sum_{w=0} ^ {n_W} a_{slice} \times dZ_{hw} \tag{2}$

d b = \sum h \sum w d Z h w (3)

$db = \sum_h \sum_w dZ_{hw} \tag{3}$

对于他们分别解释一下：

dA_prev : gradient of the cost with respect to the input of the conv layer (A_prev),
numpy array of shape (m, n_H_prev, n_W_prev, n_C_prev)

dW : gradient of the cost with respect to the weights of the conv layer (W)
numpy array of shape (f, f, n_C_prev, n_C)

db : gradient of the cost with respect to the biases of the conv layer (b)
numpy array of shape (1, 1, 1, n_C)

然后代码如下：

def conv_backward(dZ, cache):
    """
    Implement the backward propagation for a convolution function


    """

    ### START CODE HERE ###
    # Retrieve information from "cache"
    (A_prev, W, b, hparameters) = cache

    # Retrieve dimensions from A_prev's shape
    (m, n_H_prev, n_W_prev, n_C_prev) = A_prev.shape

    # Retrieve dimensions from W's shape
    (f, f, n_C_prev, n_C) = W.shape

    # Retrieve information from "hparameters"
    stride = hparameters['stride']
    pad = hparameters['pad']

    # Retrieve dimensions from dZ's shape
    (m, n_H, n_W, n_C) = dZ.shape

    # Initialize dA_prev, dW, db with the correct shapes
    dA_prev = np.zeros((m, n_H_prev, n_W_prev, n_C_prev))                           
    dW = np.zeros((f, f, n_C_prev, n_C))
    db = np.zeros((1, 1, 1, n_C))

    # Pad A_prev and dA_prev
    A_prev_pad = zero_pad(A_prev, pad)
    dA_prev_pad = zero_pad(dA_prev, pad)

    for i in range(m):                       # loop over the training examples

        # select ith training example from A_prev_pad and dA_prev_pad
        a_prev_pad = A_prev_pad[i]
        da_prev_pad = dA_prev_pad[i]

        for h in range(n_H):                   # loop over vertical axis of the output volume
            for w in range(n_W):               # loop over horizontal axis of the output volume
                for c in range(n_C):           # loop over the channels of the output volume

                    # Find the corners of the current "slice"
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f

                    # Use the corners to define the slice from a_prev_pad
                    a_slice = a_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :]

                    # Update gradients for the window and the filter's parameters using the code formulas given above
                    da_prev_pad[vert_start:vert_end, horiz_start:horiz_end, :] += W[:,:,:,c] * dZ[i, h, w, c]
                    dW[:,:,:,c] += a_slice * dZ[i, h, w, c]
                    db[:,:,:,c] += dZ[i, h, w, c]

        # Set the ith training example's dA_prev to the unpaded da_prev_pad (Hint: use X[pad:-pad, pad:-pad, :])
        dA_prev[i, :, :, :] = dA_prev_pad[i, pad:-pad, pad:-pad, :]
    ### END CODE HERE ###

    # Making sure your output shape is correct
    assert(dA_prev.shape == (m, n_H_prev, n_W_prev, n_C_prev))

    return dA_prev, dW, db

其中的参数dZ 以及 cache

dZ : gradient of the cost with respect to the output of the conv layer (Z), numpy array of shape (m, n_H, n_W, n_C)
cache ： cache of values needed for the conv_backward(), output of conv_forward()

这里按照公式走，仔细看看会发现是这么回事，不过确实挺复杂，毕竟有很多个矩阵切片操作

输出看看

np.random.seed(1)
dA, dW, db = conv_backward(Z, cache_conv)
print("dA_mean =", np.mean(dA))
print("dW_mean =", np.mean(dW))
print("db_mean =", np.mean(db))

结果如下：

dA_mean = 9.60899067587

dW_mean = 10.5817412755

db_mean = 76.3710691956

2. 池化层的反向传播

1. 对于最大池化

接下来将创建一个叫create_mask_from_window的函数，传入参数是一个矩阵切片（其实就是一个小矩阵）可以返回一组bool矩阵，即如果该位置是最大值，就为True,否则是False。

直接上实例，看看输出还是更好理解：

def create_mask_from_window(x):
    """
    Creates a mask from an input matrix x, to identify the max entry of x.

    Arguments:
    x -- Array of shape (f, f)

    Returns:
    mask -- Array of the same shape as window, contains a True at the position corresponding to the max entry of x.
    """

    ### START CODE HERE ### (≈1 line)
    mask = (x == np.max(x))
    ### END CODE HERE ###

    return mask

输出

np.random.seed(1)
x = np.random.randn(2,3)
mask = create_mask_from_window(x)
print('x = ', x)
print("mask = ", mask)

结果是：

x =  [[ 1.62434536 -0.61175641 -0.52817175]

 [-1.07296862  0.86540763 -2.3015387 ]]

mask =  [[ True False False]

 [False False False]]

这样是不是就好理解多了呢。
对了，这个最后是用在最大池化的反向传播里的。

2. 对于平均池化

接下来这个是对于平均池化的
我们会创建一个distribute_value函数，其中参数是dz,以及切片矩阵的规模，最后返回的是一个shape规模的矩阵，里面的值为

d z s h a p e _ H * s h a p e _ W

$\frac{dz}{shape\_H * shape\_W}$

还是看看实例把，这样更好理解

def distribute_value(dz, shape):
    """
    Distributes the input value in the matrix of dimension shape

    Arguments:
    dz -- input scalar
    shape -- the shape (n_H, n_W) of the output matrix for which we want to distribute the value of dz

    Returns:
    a -- Array of size (n_H, n_W) for which we distributed the value of dz
    """

    ### START CODE HERE ###
    # Retrieve dimensions from shape (≈1 line)
    (n_H, n_W) = shape

    # Compute the value to distribute on the matrix (≈1 line)
    average = dz / (n_H * n_W)

    # Create a matrix where every entry is the "average" value (≈1 line)
    a = np.ones(shape) * average
    ### END CODE HERE ###

    return a

输出一下：

a = distribute_value(2, (2,2))
print('distributed value =', a)

结果：

distributed value = [[ 0.5  0.5]

 [ 0.5  0.5]]

好了，以上就是对于两种池化的反向传播的内部原理的代码实现，然后我们现在需要将他们合起来，看下面

3. 池化反向传播整合

我们先直接看看代码

def pool_backward(dA, cache, mode = "max"):
    """
    Implements the backward pass of the pooling layer

    Arguments:
    dA -- gradient of cost with respect to the output of the pooling layer, same shape as A
    cache -- cache output from the forward pass of the pooling layer, contains the layer's input and hparameters 
    mode -- the pooling mode you would like to use, defined as a string ("max" or "average")

    Returns:
    dA_prev -- gradient of cost with respect to the input of the pooling layer, same shape as A_prev
    """

    ### START CODE HERE ###

    # Retrieve information from cache (≈1 line)
    (A_prev, hparameters) = cache

    # Retrieve hyperparameters from "hparameters" (≈2 lines)
    stride = hparameters['stride']
    f = hparameters['f']

    # Retrieve dimensions from A_prev's shape and dA's shape (≈2 lines)
    m, n_H_prev, n_W_prev, n_C_prev = A_prev.shape
    m, n_H, n_W, n_C = dA.shape

    # Initialize dA_prev with zeros (≈1 line)
    dA_prev = np.zeros_like(A_prev)

    for i in range(m):                       # loop over the training examples

        # select training example from A_prev (≈1 line)
        a_prev = A_prev[i]

        for h in range(n_H):                   # loop on the vertical axis
            for w in range(n_W):               # loop on the horizontal axis
                for c in range(n_C):           # loop over the channels (depth)

                    # Find the corners of the current "slice" (≈4 lines)
                    vert_start = h * stride
                    vert_end = vert_start + f
                    horiz_start = w * stride
                    horiz_end = horiz_start + f

                    # Compute the backward propagation in both modes.
                    if mode == "max":

                        # Use the corners and "c" to define the current slice from a_prev (≈1 line)
                        a_prev_slice = a_prev[vert_start:vert_end, horiz_start:horiz_end, c]
                        # Create the mask from a_prev_slice (≈1 line)
                        mask = create_mask_from_window(a_prev_slice)
                        # Set dA_prev to be dA_prev + (the mask multiplied by the correct entry of dA) (≈1 line)
                        dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += mask * dA[i, vert_start, horiz_start, c]

                    elif mode == "average":

                        # Get the value a from dA (≈1 line)
                        da = dA[i, vert_start, horiz_start, c]
                        # Define the shape of the filter as fxf (≈1 line)
                        shape = (f, f)
                        # Distribute it to get the correct slice of dA_prev. i.e. Add the distributed value of da. (≈1 line)
                        dA_prev[i, vert_start: vert_end, horiz_start: horiz_end, c] += distribute_value(da, shape)

    ### END CODE ###

    # Making sure your output shape is correct
    assert(dA_prev.shape == A_prev.shape)

    return dA_prev

个人认为这里还是有不少地方需要注意的，参数也非常多，（然后等我自己把这里完全缕清楚了我再写一写理解吧，现在我怕误人子弟。。。也欢迎大佬提意见，可以上简书，我的这些图片几乎都是上传在简书的云端，可以通过图片链接找到我，或者简书搜 PerfectDemoT）

然后输出：

np.random.seed(1)
A_prev = np.random.randn(5, 5, 3, 2)
hparameters = {"stride" : 1, "f": 2}
A, cache = pool_forward(A_prev, hparameters)
dA = np.random.randn(5, 4, 2, 2)

dA_prev = pool_backward(dA, cache, mode = "max")
print("mode = max")
print('mean of dA = ', np.mean(dA))
print('dA_prev[1,1] = ', dA_prev[1,1])  
print()
dA_prev = pool_backward(dA, cache, mode = "average")
print("mode = average")
print('mean of dA = ', np.mean(dA))
print('dA_prev[1,1] = ', dA_prev[1,1])

输出结果：

mode = max

mean of dA =  0.145713902729

dA_prev[1,1] =  [[ 0.          0.        ]

 [ 5.05844394 -1.68282702]

 [ 0.          0.        ]]