Question 1
Suppose
w
is the weight on some connection in a neural network. The network is trained using gradient descent until the learning
converges. However, the dataset consists of two mini-batches, which differ from each other somewhat. As usual, we alternate between the mini-batches for our gradient calculations, and that has implications for what happens after convergence. We plot the change of
w
as training progresses. Which of the following scenarios shows that convergence has occurred?
Notice that we're plotting the change in
w
, as opposed to
w
itself.
Note that in the plots below, each
iteration refers to a single
step of steepest descent on a
single minibatch.
Question 2
Suppose you are using mini-batch gradient descent for training some neural net on a large dataset. You have to decide on the learning rate, weight initializations, preprocess the inputs etc. You try some values for these and find that the value of the objective function on the training set decreases smoothly but very slowly. What could be causing this? Check all that apply.
Question 3
Four datasets are shown below. Each dataset has two input values (plotted below) and a target value (not shown). Each point in the plots denotes one training case. Assume that we are solving a classification problem. Which of the following datasets would most likely be easiest to train using neural nets ?
Question 4
Claire is training a neural net using mini-batch gradient descent. She chose a particular learning rate and found that the training error decreased as more iterations of training were performed, as shown here in blue:
She was not sure if this was the best she could do. So she tried a
bigger learning rate. Which of the following error curves (shown in red) might she observe now? Select the two most likely plots.
Note that in the plots below, each
iteration refers to a single
step of steepest descent on a
single minibatch.
Question 5
In the lectures, we discussed two kinds of gradient descent algorithms: mini-batch and full-batch. For which of the following problems is mini-batch gradient descent likely to be
a lot better than full-batch gradient descent?