【深度学习入门】NNDL学习笔记（一）

最新推荐文章于 2024-04-09 09:30:52 发布

AsanoKiri

最新推荐文章于 2024-04-09 09:30:52 发布

阅读量936

点赞数

文章标签： NNDL 深度学习

本文链接：https://blog.youkuaiyun.com/AsanoKiri/article/details/94854795

版权

前言

http://neuralnetworksanddeeplearning.com

本文是此电子书学习笔记。现初步完结。抽空会补上Softmax和部分练习题。

chapter 1 using nn to recognize handwriten digits

neural network uses the examples to automatically infer rules for recognizing handwritten digits.

two important types of artificial neuron : the perceptron and the sigmoid neuron 感知器，

the standard learning algorithm for neural networks: stochastic gradient descent 随机梯度下降

Perceptrons

1.A method for weighing evidence to make decisions\ to compute the elementary logical functions.

A perceptron takes several binary inputs, x1,x2,…x1,x2,…, and produces a single binary output:

The neuron's output, 0 or 1, is determined by whether the weighted sum $w\cdot x\equiv \sum_j w_jx_j$ is less than or greater than some threshold value. The threshold is a real number which is a parameter of the neuron.

$output=\left\{\begin{matrix} 0& if\ w\cdot x+b\leq0\\ 1& if\ w\cdot x+b>0 \end{matrix}\right.$

Perceptrons are also universal for computation.

Sigmoid Neurons

1. Crucial fact to learn: A small change in a weight (or bias) causes only a small change in output.

activation function $f(w\cdot x+b)$

$output=\frac{1}{1+exp(-\sum_jw_j x_j-b)}=\frac{1}{1+exp(-w\cdot x-b)}$

$\Delta output$ is a linear function of the changes $\Delta w_j$ , $\Delta b$ .

Exercises

1. Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, the behavior of the network doesn't change.

2. Because $-c(w\cdot x+b)=0 \Rightarrow f(.)=\frac{1}{2}$ , but it should be 0 as the ouput of a perceptron.

The Architecture of a NN

1. MLPs = multilayer perceptrons

2. feedforward NN vs. recurrent NN (a neuron's output only affects its input at some later time)

A Simple Network to Classify handwritten digits

1. Learning with gradient descent

What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x)for all training inputs x. To quantify how well we're achieving this goal we define a cost function*: Sometimes referred to as a loss or objective function.

quadratic cost function \ mean squared error \ MSE: $C(w,b)\equiv \frac{1}{2n} \sum_x ||y(x)-a||^2$

Suppose in particular that C is a function of m variables, v1,…,vm： $\Delta C= \nabla C \cdot \Delta v$

$\Delta v = -\eta\nabla C$ , $v^{'}= v - \eta\nabla C$

One problem: we need to compute the gradients ∇Cx separately for each training input x.

Solution ---- stochastic gradient descent:

mini-batch of size m, a commonly used and powerful technique.

$\frac{\sum_{j=1}^{m} \nabla C_{X_j}}{m} \approx \frac{\sum_{x} \nabla C_x}{n} = \nabla C$

2.Ball-mimicking variations

Have advantages but a major disadvantage: it turns out to be necessary to compute second partial derivatives of C, and this can be quite costly.

Exercises

An extreme version of gradient descent is to use a mini-batch size of just 1. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do).

One advantage: Faster.

One disadvantage: The batch can be not sufficient enough to represent all the input. And it's highly dependent on the sequence of batch.