Deep Learning
知识点
- 各种学习。
- semi - supervised learning:部分有label,部分没有label。
- transfer learning:包含不相关数据的学习
- unsupervised learning:没有label的学习
- structured learning:输入和输出都有结构化的对象。输出可以是图像,语音,语句等,较复杂。
- reinforcement learning:从评价中学习,比较符合人类真实的学习过程。
- Learning rate调参: 随着epoch增大减小。
- Adagrad: 自适应调整学习率
w t + 1 = w t − η t σ t g t σ t = 1 t + 1 ∑ i = 0 t ( g i ) 2 w_{t+1} = w_{t}-\frac{\eta_{t}}{\sigma_{t}}g^{t} \quad \sigma_{t}= \sqrt{\frac{1}{t+1}\sum_{i=0}^{t}(g^{i})^2 } wt+1=wt−σtηtgtσt=t+11i=0∑t(gi)2
-
反向传播:链导法。
-
Keras: 集成度高,灵活度不高,高层API,易学易用。e.g.
from Keras import ....
model = Sequential()
#first layer
model.add(Dense(input_dim=28*28,units = 500,activation='sigmoid'))
#second layer
model.add(Dense(units = 500,activation='relu'))
#output layer
model.add(Dense(units=10,activation='softmax'))
#model configure
model.compile(loss = 'categorical crossentropy',
optimizer = 'adam',metrics=['accuracy'])
# train model
model.fit(train_x,train_y,batch_size=100,nb_epoch= 20)
model.evaluate(validation_x,validation_y)
result = model.predict(test_x)
-
Mini-Batch下降法:batch size可以取2的平方,太大会导致表现变差。
Mini-batch比Stochastic Gradient方法快在先合成矩阵再进行计算,Matrix Optimization。
-
梯度消失:随着层数变多,靠近输出层,梯度较大,更新较快,靠近输入层更新较慢。由于链导法导致。
解决方法:ReLu。
σ ( z ) = m a x ( z , 0 ) \sigma(z) = max(z,0) σ(z)=max(z,0)- 很快
- 无限的sigmoid function叠在一起
- 解决梯度消失问题
- 生物学理由
-
Maxout:Learnable activation function。ReLU是Maxout的特殊情况。
-
RMS Prop:
-
Dropout作用: 随机丢弃一部分neuron,防止过拟合。Testing时不要dropout。
算是一种ensemble的方法,每次使用不同的set来training。
-
图像。
- CNN:卷积神经网络
- Pooling Layer:池化层
- Flatten:压扁后全连接输出
-
CNN核心参数:
- stride: 卷积框运算间隔
- padding:是否在周围加上一圈值,防止stride无法整除
- filter size: 长,宽,通道数
卷积参数量计算:
f 1 × x × x × f 2 f_1\times x \times x\times f2 f1×x×x×f2
其中 f 1 f_1 f1为上层通道数, f 2 f_2 f2为下层通道数。
-
语音识别
- classification问题: input -> acoustic feature, output -> state
- Each state has a stationary distribution for acoustic features
- 常用模型包括HMM和GMM
- Phoneme音调
- The lower layers detect the manner of articulation
- All the phonemes share the results from the same set of detectors
- Use parameters effectively
- End to end Learning: 不需要考虑中间的hand crafted过程。
-
半监督学习:
- Transductive Learning: Unlabeled data is the testing data
- Inductive Learning Unlabeled data is not the testing data
- Why doing this? Collecting ‘labeled’ data is hard所以我们需要做半监督学习。
-
Graph-based 相似度衡量:
-
PCA:
- 求解过程
z 1 = w 1 ⋅ x z 1 ˉ = w 1 ⋅ x ˉ z_1= w^1\cdot x \quad \bar{z_1} = w^{1}\cdot \bar{x} z1=w1⋅xz1ˉ=w1⋅xˉ
V a r ( z 1 ) = ∑ z 1 ( z 1 − z 1 ˉ ) 2 = ∑ x ( w 1 x − w 1 x ˉ ) 2 = ∑ ( w 1 ⋅ ( x − x ˉ ) ) 2 = ∑ ( ( w 1 ) T ) ( x − x ˉ ) ( x − x ˉ ) T w 1 = ( w 1 ) T C o v ( x ) w 1 \begin{aligned} Var(z_1) &= \sum_{z_1}(z_1-\bar{z_1})^2 \\ & = \sum_{x}(w^1x-w^1\bar{x})^2\\ & = \sum(w^1\cdot(x-\bar{x}))^2\\ & = \sum((w^1)^{T})(x-\bar{x})(x-\bar{x})^{T}w^1\\ & = (w^1)^{T}Cov(x)w^1 \end{aligned} Var(z1)=z1∑(z1−z1ˉ)2=x∑(w1x−w1xˉ)2=∑(w1⋅(x−xˉ))2=∑((w1)T)(x−xˉ)(x−xˉ)Tw1=(w1)TCov(x)w1
所以转换为如下优化问题:
m a x i m i z i n g ( w 1 ) T S w 1 s . t . ( w 1 ) T w 1 = 1 maximizing \quad (w^1)^{T}Sw^1 \quad s.t. (w^1)^{T}w^1=1 maximizing(w1)TSw1s.t.(w1)Tw1=1
使用拉格朗日乘子法求解:
g
(
w
1
)
=
(
w
1
)
T
S
w
1
−
α
(
(
w
1
)
T
w
1
−
1
)
g(w^1) = (w^1)^{T}Sw^1-\alpha((w^1)^{T}w^1-1)
g(w1)=(w1)TSw1−α((w1)Tw1−1)
{
∂
g
(
w
1
)
∂
w
1
1
=
0
∂
g
(
w
1
)
∂
w
2
1
=
0
…
⇒
S
w
1
=
α
w
1
⇒
(
w
1
)
T
S
w
1
=
α
\begin{cases} \frac{\partial g(w^1)}{\partial w_1^1}=0 \\ \frac{\partial g(w^1)}{\partial w_2^1}=0\\ \dots \end{cases} \Rightarrow Sw^1 = \alpha w^1 \Rightarrow (w^1)^{T}Sw^1 = \alpha
⎩⎪⎪⎨⎪⎪⎧∂w11∂g(w1)=0∂w21∂g(w1)=0…⇒Sw1=αw1⇒(w1)TSw1=α
α
\alpha
α最大化即
w
1
w^1
w1是S的最大特征值对应的特征向量。同理,
w
2
w^2
w2是S的第二大特征值对应的特征向量。
- 主成分分析去关联性:
z = W x C o v ( z ) = D z=Wx \quad Cov(z)=D z=WxCov(z)=D
C o v ( z ) = ∑ ( z − z ˉ ) ( z − z ˉ ) T = W S W T = W S [ w 1 … w K ] = W [ S w 1 … S w K ] = W [ λ 1 w 1 … λ K w K ] = [ λ 1 W w 1 … λ K W w K ] = [ λ 1 e 1 … λ K e K ] \begin{aligned} Cov(z) &= \sum(z-\bar{z})(z-\bar{z})^{T}=WSW^{T} \\ & = WS[w^1 \dots w^{K}] \\ & = W[Sw^1 \dots Sw^{K}] \\ & = W[\lambda_1w^1 \dots \lambda_{K}w^{K}] \\ & = [\lambda_1Ww^1\dots \lambda_{K}Ww^{K}]\\ & = [\lambda_1e^1 \dots \lambda_{K}e^{K}] \end{aligned} Cov(z)=∑(z−zˉ)(z−zˉ)T=WSWT=WS[w1…wK]=W[Sw1…SwK]=W[λ1w1…λKwK]=[λ1Ww1…λKWwK]=[λ1e1…λKeK] - PCA看上去像有一隐层的神经网络,线性激活函数。
更多方法:
-
CBOW:上下文预测该词。Skip-Gram:该词预测上下文。
-
LLE: locally Linear Embedding:
-
Laplacian Eigenmap, Graph based approach:
L = ∑ x r C ( y r , y r ^ ) + λ S S = 1 2 ∑ i , j w i , j ( y i − y j ) 2 = y T L y L = D − W g r a p h m a t r i x L=\sum_{x^{r}}C(y^{r},\hat{y^{r}})+\lambda S \\ S = \frac{1}{2}\sum_{i,j}w_{i,j}(y^{i}-y^{j})^2=y^{T}Ly\\ L = D-W \quad graph matrix L=xr∑C(yr,yr^)+λSS=21i,j∑wi,j(yi−yj)2=yTLyL=D−Wgraphmatrix
For unsupervised learning,
S = 1 2 ∑ i , j w i , j ( z i − z j ) 2 S=\frac{1}{2}\sum_{i,j}w_{i,j}(z^{i}-z^{j})^2 S=21i,j∑wi,j(zi−zj)2
s p a n { z 1 , z 2 , … , z m } = R m span\{z^1,z^2,\dots,z^{m}\}=R^{m} span{z1,z2,…,zm}=Rm
Spectral clustering: clustering on z -
T-distributed Stochastic Neighbor Embedding(t-SNE):
高维点进行可视化的好方法。
excellent tutorial of t-SNE:
https://github.com/oreillymedia/t-SNE-tutorial -
Auto - Encoder
- text retrieval
Vector Space Model- 把query或者document表示成bag_of_words向量
- 使用autoencoder转换成2维向量
- Similar image search
- 将图像转换成256维向量
- 使用距离度量计算相似度
- Generative model:
- Pixel RNN
- Variational Auto encoder
- Generative Adversarial Network(拟态)
GAN非常难训练。
- Transfer learning:用不太相关数据辅助训练
- Tasks: Speech recognition, image recognition, Text analysis, etc.
- Finetune Model:
- Speech: copy the last layers
- Images: copy the first layers
- Multitask Learning: 共用某些过程,执行不同任务
- Domain Adversarial training
- zero shot learning
- Structured Learning:
- Three problems
- Evaluation: What does F(x,y) looks like
- Inference: How to solve the ‘arg max’ problem
- Training: Given training data, how to find F(x,y)
- Example1: Object Detection: 使用CNN输出bounding box和box label
-
Structured SVM:
https://blog.youkuaiyun.com/yjw123456/article/details/105010218
和SVM的异同:都是二次规划问题,structured SVM限制更多,使用cutting plane法求解。 -
Sequence Labeling:
Example: POS Tagging(标记一个句子中每个词的词性)
-
HMM:
-
CRF(Conditional Random Field)
- RNN(Recurrent Neural Network)
- LSTM
需要的参数量是一般的神经元的4倍。
LSTM是为了解决梯度消失问题:- memory and input是相加的
- 遗忘门一般开着
- Bagging:
- Boosting:
- Stacking
- Deep reinforcement learning
- Step1
- Step2
- Step3
计算reward期望后,使用梯度下降法求解。
- 总结