1. 二分分类 Binary Classification
1.1 Logistic Regression
There is a cat picture. You want to know whether it is a cat picture, if yes, labeled () = 1;otherwise
= 0.
A single training examples
An entire training set consists of :
1:a set of .
: The number of samples.
: the number of train samples;
: the number of test samples.
y belongs to a 1*m space. (这个和我学的不同,一般不是m*n的矩阵吗?)
So, let's simply discuss the logistic regression below:
When given a x^(m) (X in a R^(nx) dimensions) (the input data), we need a y^(m) to label X^(m) like {(X, y)}.
But if you want your y is a possibility label which shows the possibility of result is one. For example, you want to know the possibility of to be sunny (not rainy) tomorrow. How to do that?
(首先,y标记的是1的概率,不是0的概率;其次,我想输出的是一个概率,而不是数,概率的范围是(0,1),数的范围说不清。那么,如果将无穷范围的数缩小成(0,1)的概率呢?)
We know that the range of w.T.dotX is +∞ to -∞. The problem is transfer to how to narrow down the results in a (0,1) region. Notice the sigmoid function from above picture. The sigmoid function sig = 1/1+e^(-z) has the ability. Imagining the z is w.T.dotX, when z close to +∞,(larger positive number) e^(-z) -> 0, sig -> 1 comparing that sig -> 0 when z close to -∞(small or larger negative number); when w.t.dotX is 0, sig = 0.5. So, we can set that when p(sig) > 0.5, y =1; p(sig) < 0.5, y = 0.
Cost function
Loss function definition:
One thing you could do is define the loss when your algorithm outputs y-hat and the true label as Y to be. (也就是看自己预测的值和真实值之间的差距,那肯定是 as small as possible).
Some important info above:
- Given you a set of
, you want to your prediction y-hat close to y. (这样才算预测准备)
- Finding a Loss functionj which measures the diference between y-hat and real y.
- When
, y-hat should be predicted as 1; So,
should be as small as possible.
should be as large as possible,
should as large as possible.
should close to 1.
- Loss function is dealing with single neural network; Cost finction to large samples.
Gradient Descent
- When w,b change, the cost function will change. So, the queation transfer to find a good w,b to let cost function as small as possible.
- α is the learning rate, and controls how big a step we take on each iteration or gradient descent.
- dw can be represented the deviation of dJ(w)/dw in python.
So, w and b gradient descent can be this:
其实就是分别对w和b求偏导,然后一步一步获得减去偏导。
Computation Graph (计算图)
When you want to compute a function J(a,b,c) = 3(a+bc)时,可以预设,
和
。那么,就可以做出一个计算图出来,蓝色的线条表示 forward propagation 前向传播;红色的线表示backward propagation反向传播。那么,就能通过反向传播求出每个input和hidden layer相对于
的导数。为什么要求这个导数呢?求导数就是求当因子改变时,对函数J的影响多少。下面,通过一个计算的例子来介绍这个计算方法。
好了,先从右向左求。当改变1时,
改变3,那么
=3;当
改变1时,
改变3,那么
也是3;因为
,所以
。这就是一个链式法则。那么,求
。这样,从右向左乘每个因子的导数就能求得input和函数的改变的关联情况了。
那么,在python命名中,因为所有的导数的分子都是,省去它,
,其余以此类推即可。
用Logistic Regression做例子来看。
So, first, we list all the formulas needed to caculate loss function of logistic regression. Then, we can get the computation graph. w1,w2 and b are the parameters we needed to refine which depends the model quality.
So, in order to caculate the deriate of w1,w2 and b of L(a,y): ,
and
, we can use backward propagation.
- caculate
=
.
- caculate
,
is the derivate of sigmoid function. So, dz =
。
- caculate
and other.
So, we can get how to caculate or optimize the parameters use coumputation graph. (导数复合求导,讲的很不错,能够理解公式)
Gradient Descent in m samples?
When total samples = m = [1,2,3,4...m], we use represent the sample
,
represent the deriate of sample
. For total sample
, they should be the average of every sample
。(对每个sample的parameter参数的导数求平均)
python计算思路:
有2个缺点:
写2个for loop;for loop会降低性能;怎么办呢? Vectorization。
Broadcasting in Python
import numpy as np
A = np.array([[56.0,0.0,4.4,68.0],
[1.2,104.0,52.0,8.0],
[1.8,135.0,99.0,0.9]])
print(A)
[[ 56. 0. 4.4 68. ]
[ 1.2 104. 52. 8. ]
[ 1.8 135. 99. 0.9]]
cal = A.sum(axis=0) #sum vertically 纵向加
#这是个横向的 因为是一维数组,一维数组默认横向 维度是(1,n)
cal
array([ 59. , 239. , 155.4, 76.9])
#这是个横向的 二维数组,一个子数组代表一个row
cal = cal.reshape(1,4)
cal
array([[ 59. , 239. , 155.4, 76.9]])
#如果是二维数组中的纵向
cal_verti = cal.reshape(4,1)
cal_verti
array([[ 59. ],
[239. ],
[155.4],
[ 76.9]])
percentage = 100 * A/cal
percentage
array([[94.91525424, 0. , 2.83140283, 88.42652796],
[ 2.03389831, 43.51464435, 33.46203346, 10.40312094],
[ 3.05084746, 56.48535565, 63.70656371, 1.17035111]])
#如果加上常数
B = np.array([1,2,3,4])
B + 100
array([101, 102, 103, 104])
c = np.array([[1,2,3],[4,5,6]])
c1 = np.array([100,200,300])
c2 = np.array([[100],[200]])
print(c + c1)
print(c+c2)
[[101 202 303]
[104 205 306]]
[[101 102 103]
[204 205 206]]
#也就是numpy的加减乘数能自动给你转成匹配的矩阵,当两个矩阵的维度不同时。
a = np.random.randn(5)
print(a)
#这是个(5,)的vector,rank1没有纵坐标 it's neither a row vector nor a column vector.
a.shape
(5,)
print(a.T) #一样的
[ 0.61129989 -0.48008827 1.39754925 -0.90183129 0.13849732]
#所以最好不要用这种,可以把column加上去,这样就是二维矩阵
b = np.random.randn(5,1)
print(b)
[[ 1.46909732]
[-0.70696235]
[ 0.50947828]
[-0.48711335]
[-1.61225188]]
#这样就可以做转置了,因为是二维数组
print(b.T)
[[ 1.46909732 -0.70696235 0.50947828 -0.48711335 -1.61225188]]
#所以不要用rank1 vector。
assert(a.shape==(5,1))