Coursera - Convolutional Neural Network - Andrew Ng 学习随笔
Part 1 Fundations of Convolutional Neural Network
- 一些常识:卷积意义基本性质. 尺寸计算 [((2xPad + Input) - Filter) / Stride + 1]. Valid Convolution:不加pad, filter全部在image里面 // Full Convolution: 从filter和image刚相交开始做卷积,白色部分为填0 // Same Convolution:当filter的中心(K)与image的边角重合时,开始做卷积运算, 这里的same还有一个意思,卷积之后输出的feature map尺寸保持不变(相对于输入图片). Strided Convolution. 还有Convolution和Cross-Correlation操作的区别(有没有Flipping),但是对于CNN无所谓(都是线性权值).
- Convolutions Over Volumn: filter的channel和输入channel保持一致,一个filter输出一个单通道图,多filter多通道输出
- One Layer of a Convolutional Network:
- Simple Convolution Network Example: conv + pool + fc
- Pooling Layers: Max or Average, Hyperparameters: filter size f + stride s, pooling区域选取大小和移动步长. 尺寸计算: [(I - f) / s + 1]. 不含学习参数.
- CNN Example: LeNet-5
In fully connected there will be one bias for each neuron, so the bias become total no of neuron. In FC3 there were 120 neurons so 120 biases.)
- Why Convolutions? 1) Parameter sharing 全局扫描 2) Sparsity of connections 局部区域
- NoteBook: “.ipynb”
TensorFlow: create placeholders / initialize parameters / forward propagate/ compute the cost / create an optimizer
Part 2 Deep convolutional model: case studies
- Classic Network: LeNet-5 / AlexNet / ResNet / Inception
-
- LeNet-5, 参数量6w. Common Sense: 1) 随着网路深度上升,长宽下降通道数上升 2) 基本的网络范式, conv + Activ(Sigmoid/tanh) + pool + conv + Activ + pool + fc + fc + output …
- LeNet-5, 参数量6w. Common Sense: 1) 随着网路深度上升,长宽下降通道数上升 2) 基本的网络范式, conv + Activ(Sigmoid/tanh) + pool + conv + Activ + pool + fc + fc + output …
-
- AlexNet, ImageNet + ReLU + Mltiple GPU (2 GPU) algorithm + LRN (Not helpful)
- AlexNet, ImageNet + ReLU + Mltiple GPU (2 GPU) algorithm + LRN (Not helpful)
-
- VGG-16, 16层带权参层数, 长宽因池化层2倍减少, 通道2倍增加, 2015’s
- VGG-16, 16层带权参层数, 长宽因池化层2倍减少, 通道2倍增加, 2015’s
-
- ResNets, Same convolution会方便链接
Residual Block 比较适合学习 Identity Formulation, a^(l) = a^(l+1), 对非恒等也是不会影响效果
- ResNets, Same convolution会方便链接
-
- Networks in Networks and 1x1 Convolutions:
1x1xnum_channel - 各像素的所有通道加权求和,或理解成一个num_channel长的全连接层
该想法也被称为 Network in Network. 有降维作用,可增加网络的非线性.
一个应用例子:
- Networks in Networks and 1x1 Convolutions:
-
- Inception Network Motivation, 并行加入多尺寸卷积核 + Concatenation - Inception Module计算成本问题:
1.2亿乘法次数: 28x28x32 x (5x5x192)
采用 Bottlenecke layer 模式之后:0.124亿次乘法过程, 计算复杂度降到1/10.
- Inception Network Motivation, 并行加入多尺寸卷积核 + Concatenation - Inception Module计算成本问题:
-
- Inception Network, 合理的设计BottleNeck, 既可以减少张量维度, 不会影响计算性能, 节省了计算成本.
将多个参数不相同Module级联得到inception Network.
关于旁支上softmax输出:为了保证,即使在网络的中间部分,对结果也有较好的预测,起到防止网路过度学习的作用
- Inception Network, 合理的设计BottleNeck, 既可以减少张量维度, 不会影响计算性能, 节省了计算成本.
- Pratical Advices for Using ConvNets 在具体编程作业之前,Andrew给了几个建议:
-
- Using Open-Source Implementation: 复现难 + 超参数学习率衰减率等细节 = 阅读开源代码 + Pre-trained model / + Github
-
- Transfer Learning: 1) PreTrained Weight/Model作为初始参数 或者 取中间几层构建网路进行训练 - 2) 去掉最后Softmax 添加自己的softmax - 3) Freeze相关层参数(常常是对于卷积层, 冻结多少层, 冻结多久等等) - 4) training on your “little” dataset
-
- Data Augmentation: 几个常见操作(Distortion):Mirroring / Random Cropping / Rotation / Shearing / Local Warping / Color Shifting (PCA color augmentation) / …
数据增强和Batch-training可以并行.
- Data Augmentation: 几个常见操作(Distortion):Mirroring / Random Cropping / Rotation / Shearing / Local Warping / Color Shifting (PCA color augmentation) / …
-
- State of CV: Data vs Hand-engnieer, 当数据量够大,简单的设计即可,不需要很多Hand-engineer,当数据量少,则需要更多的Hand部分,数据量排序:Speech recognition Image Recognition Object Detection Object Tracking (数据标注的难易程度决定了数据量可以多大,而标注的难度似乎和问题的难度成正比). Two Sources of Knowledge:Labeled Data,Hand Engineer features/network architecture/Other Components.
人工设计的组件可以很有效,但可能受经验限制,泛化性差.
数据量小的情况下,迁移学习会起到非常大的作用.
Benchmark and Competition winner:
- State of CV: Data vs Hand-engnieer, 当数据量够大,简单的设计即可,不需要很多Hand-engineer,当数据量少,则需要更多的Hand部分,数据量排序:Speech recognition Image Recognition Object Detection Object Tracking (数据标注的难易程度决定了数据量可以多大,而标注的难度似乎和问题的难度成正比). Two Sources of Knowledge:Labeled Data,Hand Engineer features/network architecture/Other Components.
Part 3 Object Detection
- (Single Object) Classification + Object Localization, y = (isExistObj, ObjClass, ObjRect), loss = {if isExistObj: 三项真实值差异被考虑; else: 只考虑isExistObj}
- Landmark Detection, 人脸识别中, y https://blog.youkuaiyun.com/BeBuBu/article/details/100768985= (isFace, FacePoints), 类似的人的姿态识别.
- (Multiple Object) Object Detection: Sliding Windows, 计算成本过高
-
- Turning FC layer into Convolutional layer: Valid Conv + 滤波长宽和输出长宽一致, 滤波器个数就是FC结点数目.
-
- Turning Window-slide into convolution operation (Overfeat):
-
-
- 图示如下,最后一层各个像素点其实就是对初始图去取相应滑动窗的结果.
- 图示如下,最后一层各个像素点其实就是对初始图去取相应滑动窗的结果.
-
- BBox的精度问题: 步长+BBox固定形状
-
- YOLO: You only look once. 将输出层网格化,每个网格内(isExistObj, ObjClass, ObjRect),假设每个网格元只包含一个目标, 这种方法不在受制于BBox的步长和形状. 对于ObjRect = {bx by bh bw},都是除以网格的长宽,前两者在0~1之间,后两者可以 1,横跨多个网格
- What you should remember:
-
- YOLO is a state-of-the-art object detection model that is fast and accurate
It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
The encoding can be seen as a grid where each of the 19x19 cells contains information about 5 boxes.
You filter through all the boxes using non-max suppression. Specifically:
Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes
Intersection over Union (IoU) thresholding to eliminate overlapping boxes
Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise. If you wish, you can also try fine-tuning the YOLO model with your own dataset, though this would be a fairly non-trivial exercise.
- YOLO is a state-of-the-art object detection model that is fast and accurate
- IoU: Intersection Over Predictions, The measure of the overlap between two bounding boxes.
- NMS: Non-max Suppression, 和YOLO的网格搜索方法相互结合.
-
- 找局部最大概率点,按照IoU进行抑制(比如IoU>0.5)
- Anchor boxes: For Overlapping Objects: 如下例子,
-
- 但是存在缺陷:1) 当一个网格存在>=2个Anchor 2) 当存在两个相同形状的Anchor时间. 则需要自设一个挑选机制
- YOLO Algorithm:
- RPN, Region Proposal Network, From R-CNN
- 相关套路简介Stages in Object Detection
- Anchor的一些白话Anchor based / free
- 有空读下Anchor-free的一些论文,ExtremeNet / CornerNet / FSAF - 2019CVPR
Part 4
1) Face Recognition
-
- the diff and same bet face recog and verif:
- the diff and same bet face recog and verif:
- One-shot Learning
-
- Similarity Function
- Siamese Network - 2014 DeepFace closing the gap to human level performance
-
- Loss function: Triplet Loss, 同时观察三个图: Negative(N) - Anchor(A) - Positive§,思想:
- Loss function: Triplet Loss, 同时观察三个图: Negative(N) - Anchor(A) - Positive§,思想:
-
- 上述alpha即margin,不可缺少!
-
- from FaceNet - 2015, loss = sum(max(delta + alpha, 0))
-
-
- Choosing the triplets A P N,如果只是随机选取三元组(AP同 AN异)条件容易被满足, 需要“hard”的三元组,delta接近0的,这样alpha才有价值.
-
- Face Verification and Binary Classification
What you should remember:
Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.
2) Nerual Style Transfer
- initiate G randomly + GD(+cost function)
- cost function = SimilarityofContent + SimilarityofStyle
某一层各个通道之间的相关程度定义为风格
Here’s what the program will have to do:
1 Create an Interactive Session
2 Load the content image
3 Load the style image
4 Randomly initialize the image to be generated
5 Load the VGG16 model
6 Build the TensorFlow graph:
6.1 Run the content image through the VGG16 model and compute the content cost
6.2 Run the style image through the VGG16 model and compute the style cost
6.3 Compute the total cost
6.4 Define the optimizer and the learning rate
7 Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.
What you should remember:
Neural Style Transfer is an algorithm that given a content image C and a style image S can generate an artistic image
It uses representations (hidden layer activations) based on a pretrained ConvNet.
The content cost function is computed using one hidden layer’s activations.
The style cost function for one layer is computed using the Gram matrix of that layer’s activations. The overall style cost function is obtained using several hidden layers.
Optimizing the total cost function results in synthesizing new images.
Convolutions in 1D, 2D & 3D