完成了Gradient+Checking

这篇博客分享了作者完成Gradient Checking作业的体会,强调专注能有效提高效率。作者指出,梯度检查在应用dropout前是必要的,以验证反向传播的正确性,但之后则不再适用。同时,作者提到了正在进行的英语竞赛练习,但由于事务繁多,进度受到影响,且提出对于一道题目中逗号后下划线含义的疑问。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

这么久才进行了这个作业,其实只要专心地做不走神就能很快地完成作业。梯度检查很实用,需要注意的是在用dropout之前用梯度检查确保反向传播是正确的,用了dropout以后就不要梯度检查了。

把所有的theta转换成各个参数对应的:

# 向量theta分解为适用于计算的W1, b1, W2, b2, W3, b3
def vector_to_dictionary(theta):
    """
    Unroll all our parameters dictionary from a single vector satisfying our specific required shape.
    """
    parameters = {}
    parameters["W1"] = theta[:20].reshape((5,4))
    parameters["b1"] = theta[20:25].reshape((5,1))
    parameters["W2"] = theta[25:40].reshape((3,5))
    parameters["b2"] = theta[40:43].reshape((3,1))
    parameters["W3"] = theta[43:46].reshape((1,3))
    parameters["b3"] = theta[46:47].reshape((1,1))

    return parameters


# 向量theta分解为适用于计算的W1, b1, W2, b2, W3, b3
def gradients_to_vector(gradients):
    """
    Roll all our gradients dictionary into a single vector satisfying our specific required shape.
    """
    
    count = 0
    for key in ["dW1", "db1", "dW2", "db2", "dW3", "db3"]:
        # flatten parameter
        new_vector = np.reshape(gradients[key], (-1,1))
        
        if count == 0:
            theta = new_vector
        else:
            theta = np.concatenate((theta, new_vector), axis=0)
        count = count + 1

    return theta

英语竞赛的题做完了第二套,最近事情比较多计划总是被打乱。还是没有连贯的做一整套题,这样对做题速度的提升来说很不好。

有一个问题,

parameters_values, _ = dictionary_to_vector(parameters)

这里的逗号后面的下划线是什么意思呢,百度了一下也没有解决。

<think>We are given a task to implement batch normalization in a neural network when the flag is set. The key points are: - Insert batch normalization before each ReLU nonlinearity. - Do not normalize the outputs of the last layer. - We are to implement this in a network that likely has multiple layers. We are also given a hint to look at `cs231n/layer_utils.py` for helper layers. Since the problem is about implementing batch norm in a network, we assume we are working within a framework that defines layers (like in the CS231n assignments). We are to modify the network to include batch norm layers appropriately. We are not provided with the existing network structure, but typically, a network might be built using fully connected layers and ReLU activations. We are to insert a batch norm layer before each ReLU. This means the order of operations would be: Original: Linear -> ReLU With BatchNorm: Linear -> BatchNorm -> ReLU But note: the last layer (output layer) should not have batch norm. We are also to run gradient checks after implementation. However, the user's question is not a specific coding problem but a description of a task. Therefore, we will provide a general approach and then show a code snippet that illustrates the insertion of batch norm in a network. Let's assume we are building a network using a list of layers. We can define a helper function to create a "fully connected -> batch norm -> ReLU" block, and then for the last layer, we just use a fully connected layer without batch norm and without ReLU (if it's the output). Example for a two-layer network: Input -> FC1 -> ReLU -> FC2 (output) With batch norm (for the hidden layer): Input -> FC1 -> BatchNorm -> ReLU -> FC2 (output) We can create a class that takes parameters including the normalization flag. We'll write a simplified version of such a network class. Note: The following code is for illustration and might need to be adapted to the specific assignment structure. ```python from cs231n.layers import * from cs231n.layer_utils import * class FullyConnectedNet(object): """ A fully-connected neural network with an arbitrary number of hidden layers, ReLU nonlinearities, and a softmax loss function. This will also implement dropout and batch/layer normalization as options. For a network with L layers, the architecture will be: {affine - [batch/layer norm] - relu - [dropout]} x (L - 1) - affine - softmax where batch/layer normalization and dropout are optional, and the {...} block is repeated L - 1 times. Learnable parameters are stored in the self.params dictionary and will be learned using the Solver class. """ def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, dropout=1, normalization=None, reg=0.0, weight_scale=1e-2, dtype=np.float32, seed=None): """ Initialize a new FullyConnectedNet. Inputs: - hidden_dims: A list of integers giving the size of each hidden layer. - input_dim: An integer giving the size of the input. - num_classes: An integer giving the number of classes to classify. - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=1 then no dropout is implemented. - normalization: What type of normalization the network should use. Valid values are "batchnorm", "layernorm", or None for no normalization (default). - reg: Scalar giving L2 regularization strength. - weight_scale: Scalar giving the standard deviation for random initialization of the weights. - dtype: A numpy datatype object; all computations will be performed using this datatype. float32 is faster but less accurate, so use float64 for numeric gradient checking. - seed: If not None, then pass this random seed to the dropout layers. This will make the dropout layers deteriminstic so we can gradient check the model. """ self.normalization = normalization self.use_dropout = dropout != 1 self.reg = reg self.num_layers = 1 + len(hidden_dims) self.dtype = dtype self.params = {} # Initialize the parameters of the network dims = [input_dim] + hidden_dims + [num_classes] for i in range(self.num_layers): # For layer i, the weight matrix has shape (dims[i], dims[i+1]) # and the bias has shape (dims[i+1],) self.params['W' + str(i+1)] = weight_scale * np.random.randn(dims[i], dims[i+1]) self.params['b' + str(i+1)] = np.zeros(dims[i+1]) # If using batch normalization and this is not the last layer if normalization == 'batchnorm' and i < self.num_layers - 1: # For each hidden layer, we have gamma and beta parameters for batch norm self.params['gamma' + str(i+1)] = np.ones(dims[i+1]) self.params['beta' + str(i+1)] = np.zeros(dims[i+1]) # When using dropout we need to pass a dropout_param dictionary to each # dropout layer so that the layer knows the dropout probability and the mode # (train / test). You can pass the same dropout_param to each dropout layer. self.dropout_param = {} if self.use_dropout: self.dropout_param = {'mode': 'train', 'p': dropout} if seed is not None: self.dropout_param['seed'] = seed # With batch normalization we need to keep track of running means and # variances, so we need to pass a special bn_param object to each batch # normalization layer. You should pass self.bn_params[0] to the forward pass # of the first batch normalization layer, self.bn_params[1] to the forward # pass of the second, and so on. self.bn_params = [] if self.normalization == 'batchnorm': self.bn_params = [{'mode': 'train'} for i in range(self.num_layers - 1)] if self.normalization == 'layernorm': self.bn_params = [{} for i in range(self.num_layers - 1)] # Cast all parameters to the correct datatype for k, v in self.params.items(): self.params[k] = v.astype(dtype) def loss(self, X, y=None): """ Compute loss and gradient for the fully-connected net. Input / output: Same as TwoLayerNet above. """ X = X.astype(self.dtype) mode = 'test' if y is None else 'train' # Set train/test mode for batchnorm params and dropout param since they # behave differently during training and testing. if self.use_dropout: self.dropout_param['mode'] = mode if self.normalization == 'batchnorm': for bn_param in self.bn_params: bn_param['mode'] = mode scores = None # Forward pass: for each layer cache = {} out = X # For layers 0 to L-2: affine -> batch norm (if applicable) -> relu -> dropout (if applicable) for i in range(self.num_layers - 1): W = self.params['W' + str(i+1)] b = self.params['b' + str(i+1)] # Affine layer out, cache_affine = affine_forward(out, W, b) cache['affine' + str(i+1)] = cache_affine # Batch normalization if self.normalization == 'batchnorm': gamma = self.params['gamma' + str(i+1)] beta = self.params['beta' + str(i+1)] out, cache_bn = batchnorm_forward(out, gamma, beta, self.bn_params[i]) cache['bn' + str(i+1)] = cache_bn # ReLU out, cache_relu = relu_forward(out) cache['relu' + str(i+1)] = cache_relu # Dropout if self.use_dropout: out, cache_dropout = dropout_forward(out, self.dropout_param) cache['dropout' + str(i+1)] = cache_dropout # Last layer: affine forward only (no batch norm, no ReLU, no dropout) W = self.params['W' + str(self.num_layers)] b = self.params['b' + str(self.num_layers)] scores, cache_affine = affine_forward(out, W, b) cache['affine' + str(self.num_layers)] = cache_affine # If test mode return early if mode == 'test': return scores loss, grads = 0.0, {} # Compute the loss with softmax and add regularization loss, dscores = softmax_loss(scores, y) for i in range(self.num_layers): W = self.params['W' + str(i+1)] loss += 0.5 * self.reg * np.sum(W * W) # Backward pass: start from the last layer dout = dscores # Backprop the last affine layer dout, dW, db = affine_backward(dout, cache['affine' + str(self.num_layers)]) grads['W' + str(self.num_layers)] = dW + self.reg * self.params['W' + str(self.num_layers)] grads['b' + str(self.num_layers)] = db # Backprop through the remaining layers in reverse order for i in range(self.num_layers - 2, -1, -1): # If dropout was used, backprop through dropout if self.use_dropout: dout = dropout_backward(dout, cache['dropout' + str(i+1)]) # Backprop through ReLU dout = relu_backward(dout, cache['relu' + str(i+1)]) # Backprop through batch norm if used if self.normalization == 'batchnorm': dout, dgamma, dbeta = batchnorm_backward(dout, cache['bn' + str(i+1)]) grads['gamma' + str(i+1)] = dgamma grads['beta' + str(i+1)] = dbeta # Backprop through affine layer dout, dW, db = affine_backward(dout, cache['affine' + str(i+1)]) grads['W' + str(i+1)] = dW + self.reg * self.params['W' + str(i+1)] grads['b' + str(i+1)] = db return loss, grads ``` This code is a simplified version of a fully connected network that supports batch normalization. The key points are: 1. In the constructor, we initialize weights, biases, and if batch norm is used, we also initialize gamma and beta for each hidden layer (not for the output layer). 2. In the forward pass, for each hidden layer we do: - Affine transformation - Batch normalization (if enabled) - ReLU activation - Dropout (if enabled) 3. The last layer is only an affine transformation. 4. In the backward pass, we backpropagate gradients in reverse order, handling each layer appropriately. After implementing, we should run gradient checks to ensure correctness. Note: The actual implementation might require functions from `cs231n/layers.py` and `cs231n/layer_utils.py` (like `affine_forward`, `affine_backward`, `batchnorm_forward`, `batchnorm_backward`, etc.) to be correctly implemented. Now, we must also provide 3-5 related questions at the end as per the instructions.</think>### 回答 在神经网络中设置 `normalization="batchnorm"` 时,需要在每个 ReLU 非线性激活层**之前**插入批归一化(BatchNorm)层,但**最后一层输出不进行归一化**。具体实现步骤如下: 1. **网络结构修改** 对于每个隐藏层,将原始 `Affine → ReLU` 结构改为: `Affine → BatchNorm → ReLU` **例外**:最后一层仅保留 `Affine`(无 BatchNorm 和 ReLU)。 2. **代码实现关键** 以全连接层为例,伪代码如下: ```python # 隐藏层处理(包含 BatchNorm) for layer in layers[:-1]: out = affine_forward(out, W, b) # 全连接层 out = batchnorm_forward(out, gamma, beta, bn_param) # 批归一化 out = relu_forward(out) # ReLU激活 # 最后一层(无 BatchNorm) scores = affine_forward(out, W_last, b_last) ``` 3. **梯度检查** 实现后需使用数值梯度验证: ```python from cs231n.gradient_check import grad_check_sparse # 对 BatchNorm 参数(gamma/beta)和权重进行梯度检查 ``` 4. **辅助层设计** 参考 `cs231n/layer_utils.py` 的 `affine_relu_forward`,可创建新辅助层: ```python def affine_bn_relu_forward(x, W, b, gamma, beta, bn_param): a, fc_cache = affine_forward(x, W, b) bn, bn_cache = batchnorm_forward(a, gamma, beta, bn_param) out, relu_cache = relu_forward(bn) cache = (fc_cache, bn_cache, relu_cache) return out, cache ``` **注意事项**: - BatchNorm 层需在训练时计算 `running_mean/running_var`,测试时使用统计值。 - 反向传播时需处理 `batchnorm_backward` 的梯度链。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值