ssd-tensorflow-slim

https://blog.youkuaiyun.com/qq1483661204/article/details/79776065

目标检测  -- SSD (tensorflow 版) 逐行逐句解读

       这篇博客,主要是讲解SSD,tensorflow版的实现,代码地址是:SSD-tensorflow,大神写的代码,也是github上tensorflow版的SSD star 最多的代码,所以就用它来讲解,同时附上论文地址:SSD 论文下载

     对照论文和代码讲解,代码中提供了SSD300和SSD512,代码一样,只是图像输入大小不一致,这个地方我主要讲解SSD512。

1   网络结构

       我们先来看论文上的网络结构图:


        网络结构比较简单,就是在VGG的基础上改得,前面和VGG一样,但是SSD把VGG的全连接层换成了几个卷积层,把droupout层去除了,同时使用了atrous algorithm,其实就是扩展卷积或带孔卷积(Dilation Conv),具体这个卷积方式可以看这个链接 atrous algorithm

    我们从图上也可以看出,SSD和YOLO不同的地方是,YOLO只是对最后一层特征图用来预测回归框,而SSD则是多层,不同大小的特征图都用来做预测和回归。YOLO的缺点是定位不准,对小物体检测效果差,而SSD一定长度上克服了这些难点,因为使用了不同特征图进行预测,SSD的多尺度,用的多层的特征图,是stride=2,不断缩小特征图的长和宽,越靠后的卷积特征图,他的感受野越大,越靠前感受野越小,同时越靠前检测小物体效果更好。但是SSD对小物体检测也并不好,因为前面VGG的已经把特征图下降了16倍。

        我们看下网络结构的代码:


   
  1. end_points = {}
  2. with tf.variable_scope(scope, 'ssd_512_vgg', [inputs], reuse=reuse):
  3. # Original VGG-16 blocks.
  4. net = slim.repeat(inputs, 2, slim.conv2d, 64, [ 3, 3], scope= 'conv1')
  5. end_points[ 'block1'] = net
  6. net = slim.max_pool2d(net, [ 2, 2], scope= 'pool1')
  7. # Block 2.
  8. net = slim.repeat(net, 2, slim.conv2d, 128, [ 3, 3], scope= 'conv2')
  9. end_points[ 'block2'] = net
  10. net = slim.max_pool2d(net, [ 2, 2], scope= 'pool2')
  11. # Block 3.
  12. net = slim.repeat(net, 3, slim.conv2d, 256, [ 3, 3], scope= 'conv3')
  13. end_points[ 'block3'] = net
  14. net = slim.max_pool2d(net, [ 2, 2], scope= 'pool3')
  15. # Block 4.
  16. net = slim.repeat(net, 3, slim.conv2d, 512, [ 3, 3], scope= 'conv4')
  17. end_points[ 'block4'] = net
  18. net = slim.max_pool2d(net, [ 2, 2], scope= 'pool4')
  19. # Block 5.
  20. net = slim.repeat(net, 3, slim.conv2d, 512, [ 3, 3], scope= 'conv5')
  21. end_points[ 'block5'] = net
  22. net = slim.max_pool2d(net, [ 3, 3], 1, scope= 'pool5')
  23. # Additional SSD blocks.
  24. # Block 6: let's dilate the hell out of it!
  25. net = slim.conv2d(net, 1024, [ 3, 3], rate= 6, scope= 'conv6')
  26. end_points[ 'block6'] = net
  27. # Block 7: 1x1 conv. Because the fuck.
  28. net = slim.conv2d(net, 1024, [ 1, 1], scope= 'conv7')
  29. end_points[ 'block7'] = net
  30. # Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
  31. end_point = 'block8'
  32. with tf.variable_scope(end_point):
  33. net = slim.conv2d(net, 256, [ 1, 1], scope= 'conv1x1')
  34. net = custom_layers.pad2d(net, pad=( 1, 1))
  35. net = slim.conv2d(net, 512, [ 3, 3], stride= 2, scope= 'conv3x3', padding= 'VALID')
  36. end_points[end_point] = net
  37. end_point = 'block9'
  38. with tf.variable_scope(end_point):
  39. net = slim.conv2d(net, 128, [ 1, 1], scope= 'conv1x1')
  40. net = custom_layers.pad2d(net, pad=( 1, 1))
  41. net = slim.conv2d(net, 256, [ 3, 3], stride= 2, scope= 'conv3x3', padding= 'VALID')
  42. end_points[end_point] = net
  43. end_point = 'block10'
  44. with tf.variable_scope(end_point):
  45. net = slim.conv2d(net, 128, [ 1, 1], scope= 'conv1x1')
  46. net = custom_layers.pad2d(net, pad=( 1, 1))
  47. net = slim.conv2d(net, 256, [ 3, 3], stride= 2, scope= 'conv3x3', padding= 'VALID')
  48. end_points[end_point] = net
  49. end_point = 'block11'
  50. with tf.variable_scope(end_point):
  51. net = slim.conv2d(net, 128, [ 1, 1], scope= 'conv1x1')
  52. net = custom_layers.pad2d(net, pad=( 1, 1))
  53. net = slim.conv2d(net, 256, [ 3, 3], stride= 2, scope= 'conv3x3', padding= 'VALID')
  54. end_points[end_point] = net
  55. end_point = 'block12'
  56. with tf.variable_scope(end_point):
  57. net = slim.conv2d(net, 128, [ 1, 1], scope= 'conv1x1')
  58. net = custom_layers.pad2d(net, pad=( 1, 1))
  59. net = slim.conv2d(net, 256, [ 4, 4], scope= 'conv4x4', padding= 'VALID')
  60. # Fix padding to match Caffe version (pad=1).
  61. # pad_shape = [(i-j) for i, j in zip(layer_shape(net), [0, 1, 1, 0])]
  62. # net = tf.slice(net, [0, 0, 0, 0], pad_shape, name='caffe_pad')
  63. end_points[end_point] = net
  64. # Prediction and localisations layers.
  65. predictions = []
  66. logits = []
  67. localisations = []
  68. for i, layer in enumerate(feat_layers):
  69. with tf.variable_scope(layer + '_box'):
  70. ## p = cls_pred,l = loc_pred ,表示每一层的预测结果
  71. p, l = ssd_vgg_300.ssd_multibox_layer(end_points[layer],
  72. num_classes,
  73. anchor_sizes[i],
  74. anchor_ratios[i],
  75. normalizations[i])
  76. ## 对于类别在进行tf.softmax
  77. predictions.append(prediction_fn(p))
  78. logits.append(p)
  79. localisations.append(l)

上面的代码就是在构建网络,网络也就是和VGG差不多,endpoints这个字典,里面包含的是不同特征图的输出,就是SSD不是只利用一层特征,而是多层,所以这个地方存放多层的输出。需要注意,原本程序我把每一层的输出特征图大小都计算了,结果没保存,就是如果仔细取计算每一层的输出特征图的大小,会发现,后面有8×8,4×4大小的特征图,最后一层是1×1,这个是作者设计的,所以如果想换成其他的网络,有时我们自己也是需要设计这样的,就是代码中为啥有时候要加一个padd,就是为了保证最后输出结果为1×1,以及有类似8×8和4×4大小的特征图具体怎么算,可以看这个博客:点击打开链接

我们看几个参数:


   
  1. feat_layers=[ 'block4', 'block7', 'block8', 'block9', 'block10', 'block11', 'block12'],
  2. feat_shapes=[( 64, 64), ( 32, 32), ( 16, 16), ( 8, 8), ( 4, 4), ( 2, 2), ( 1, 1)],
anchor_steps=[8, 16, 32, 64, 128, 256, 512],
   

这个几个feature -map的大小就是根据网络结构算出来的,64×64之类的,大家可以去计算,发现是对应的,block4的特征图大小就是64,所以大家想换网络,这个地方需要计算好自己改。anchor_steps也是对应的,就是特征图的缩放倍数,也是对对应的,比如:8×64=512,16×32=512等等。不能随便设置

        然后上面代码中有ssd_vgg_300.ssd_multibox_layer这个函数,我们看一下:


   
  1. def ssd_multibox_layer(inputs,
  2. num_classes,
  3. sizes,
  4. ratios=[1],
  5. normalization=-1,
  6. bn_normalization=False):
  7. """Construct a multibox layer, return a class and localization predictions.
  8. """
  9. net = inputs
  10. if normalization > 0:
  11. net = custom_layers.l2_normalization(net, scaling= True)
  12. # Number of anchors.
  13. num_anchors = len(sizes) + len(ratios)
  14. # Location.
  15. num_loc_pred = num_anchors * 4
  16. loc_pred = slim.conv2d(net, num_loc_pred, [ 3, 3], activation_fn= None,
  17. scope= 'conv_loc')
  18. loc_pred = custom_layers.channel_to_last(loc_pred)
  19. loc_pred = tf.reshape(loc_pred,
  20. tensor_shape(loc_pred, 4)[: -1]+[num_anchors, 4])
  21. # Class prediction.
  22. num_cls_pred = num_anchors * num_classes
  23. cls_pred = slim.conv2d(net, num_cls_pred, [ 3, 3], activation_fn= None,
  24. scope= 'conv_cls')
  25. cls_pred = custom_layers.channel_to_last(cls_pred)
  26. cls_pred = tf.reshape(cls_pred,
  27. tensor_shape(cls_pred, 4)[: -1]+[num_anchors, num_classes])
  28. return cls_pred, loc_pred

    上面代码中,我们对于输出特征图,直接经过3×3的卷积层输出框和类别,custom_layers.channel_to_last,这个函数其实只是把通道数放在最后,但是tensorlfow里面本来就是,所以有点多余,num_anchors表示该层框的个数。

tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes]
   
tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4]
   

tensor_shape 就是将tensor的形状拿到,然后把最后一层拆分出来,变为5维的相当于,后两两个维度代表那个框,的那个类或者框,然后返回类和框的预测,注意这个地方这两个输出都没激活函数。

然后在回到网络结构后面的代码:


   
  1. for i, layer in enumerate(feat_layers):
  2. with tf.variable_scope(layer + '_box'):
  3. ## p = cls_pred,l = loc_pred ,表示每一层的预测结果
  4. p, l = ssd_vgg_300.ssd_multibox_layer(end_points[layer],
  5. num_classes,
  6. anchor_sizes[i],
  7. anchor_ratios[i],
  8. normalizations[i])
  9. ## 对于类别在进行tf.softmax
  10. predictions.append(prediction_fn(p))
  11. logits.append(p)
  12. localisations.append(l)
  13. return predictions, localisations, logits, end_points

这个地方就是循环,然后把结果保存在一个list中,prediction_fn就是softmax,因为前面是没有激活函数的,所以prediction是保存了经过激活函数的,logits是没有激活函数的,localisations是保存预测的框,end_poins是每一层的输出。

以上就是整个网络的架构,就是利用VGG模型,把后面的全连接层改了,全部变为卷积层,然后不是只用最后一层预测框,中间不同特征图大小都有用来预测。

2   SSD 框的生成


   
  1. def anchors(self, img_shape, dtype=np.float32):
  2. """Compute the default anchor boxes, given an image shape.
  3. """
  4. return ssd_anchors_all_layers(img_shape,
  5. self.params.feat_shapes,
  6. self.params.anchor_sizes,
  7. self.params.anchor_ratios,
  8. self.params.anchor_steps,
  9. self.params.anchor_offset,
  10. dtype)

代码里面框的生成实现了两连跳,这个地方入口,调用ssd_anchors_all_layers,image_shape就是图像输入大小,我们再看这个函数:


   
  1. def ssd_anchors_all_layers(img_shape,
  2. layers_shape,
  3. anchor_sizes,
  4. anchor_ratios,
  5. anchor_steps,
  6. offset=0.5,
  7. dtype=np.float32):
  8. """Compute anchor boxes for all feature layers.
  9. """
  10. layers_anchors = []
  11. for i, s in enumerate(layers_shape):
  12. anchor_bboxes = ssd_anchor_one_layer(img_shape, s,
  13. anchor_sizes[i],
  14. anchor_ratios[i],
  15. anchor_steps[i],
  16. offset=offset, dtype=dtype)
  17. layers_anchors.append(anchor_bboxes)
  18. return layers_anchors

上面这个函数,是一个for循环,就是提取出来的需要预测框和类的特征图一层一层,layer_shape就是特征图的大小,就是前面我说的计算得到的。然后又调用ssd_anchor_one_layer,我们来看下:


   
  1. def ssd_anchor_one_layer(img_shape,
  2. feat_shape,
  3. sizes,
  4. ratios,
  5. step,
  6. offset=0.5,
  7. dtype=np.float32):
  8. """Computer SSD default anchor boxes for one feature layer.
  9. Determine the relative position grid of the centers, and the relative
  10. width and height.
  11. Arguments:
  12. feat_shape: Feature shape, used for computing relative position grids;
  13. size: Absolute reference sizes;
  14. ratios: Ratios to use on these features;
  15. img_shape: Image shape, used for computing height, width relatively to the
  16. former;
  17. offset: Grid offset.
  18. Return:
  19. y, x, h, w: Relative x and y grids, and height and width.
  20. """
  21. # Compute the position grid: simple way.
  22. # y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
  23. # y = (y.astype(dtype) + offset) / feat_shape[0]
  24. # x = (x.astype(dtype) + offset) / feat_shape[1]
  25. # Weird SSD-Caffe computation using steps values...
  26. y, x = np.mgrid[ 0:feat_shape[ 0], 0:feat_shape[ 1]]
  27. y = (y.astype(dtype) + offset) * step / img_shape[ 0]
  28. x = (x.astype(dtype) + offset) * step / img_shape[ 1]
  29. # Expand dims to support easy broadcasting.
  30. y = np.expand_dims(y, axis= -1)
  31. x = np.expand_dims(x, axis= -1)
  32. # Compute relative height and width.
  33. # Tries to follow the original implementation of SSD for the order.
  34. num_anchors = len(sizes) + len(ratios)
  35. h = np.zeros((num_anchors, ), dtype=dtype)
  36. w = np.zeros((num_anchors, ), dtype=dtype)
  37. # Add first anchor boxes with ratio=1.
  38. h[ 0] = sizes[ 0] / img_shape[ 0]
  39. w[ 0] = sizes[ 0] / img_shape[ 1]
  40. di = 1
  41. if len(sizes) > 1:
  42. h[ 1] = math.sqrt(sizes[ 0] * sizes[ 1]) / img_shape[ 0]
  43. w[ 1] = math.sqrt(sizes[ 0] * sizes[ 1]) / img_shape[ 1]
  44. di += 1
  45. for i, r in enumerate(ratios):
  46. h[i+di] = sizes[ 0] / img_shape[ 0] / math.sqrt(r)
  47. w[i+di] = sizes[ 0] / img_shape[ 1] * math.sqrt(r)
  48. return y, x, h, w

首先下面这一句是生成网格,这样实际就代表了特征图每个点的坐标:

y, x = np.mgrid[0:feat_shape[0], 0:feat_shape[1]]
下面的是将我们的特征图坐标在原图中归一化,同时加上一个偏移offset=0.5,因为是框的中心,每个框里面相当于每
个点间隔是1,所以框终点需要加上0.5,对应论文上这个公式:
y = (y.astype(dtype) + offset) * step / img_shape[0]

x坐标也是一样,然后只是增加一个维度,

num_anchors 是计算每一层框的个数,

h = np.zeros((num_anchors, ), dtype=dtype)
w = np.zeros((num_anchors, ), dtype=dtype)
# Add first anchor boxes with ratio=1.
h[0] = sizes[0] / img_shape[0]
w[0] = sizes[0] / img_shape[1]
di = 1
if len(sizes) > 1:
    h[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[0]
    w[1] = math.sqrt(sizes[0] * sizes[1]) / img_shape[1]
    di += 1
for i, r in enumerate(ratios):
    h[i+di] = sizes[0] / img_shape[0] / math.sqrt(r)
    w[i+di] = sizes[0] / img_shape[1] * math.sqrt(r)

这个地方是求框,我们看到,其实有一个框默认就是正方型的,就是第一个,也就是1:1的时候,为了适应不同长宽比列的物体

后面的计算就是根据上面将的公式来计算的,里面size表示该层框的大小,ratio是该层的框长宽比,这个地方需要注意,论文上是这样生成框的:


论文给定Smin=0.2,Smax=0.9,然后根据上面公式计算k表示第一个特征图,计算得到每层的sk,然后计算长和宽,计算公式如下:


当长宽比为1的时候,

                                                                   

多加一个上面的,但是代码不是这样实现的,他是直接给了长宽,我们看看,


   
  1. anchor_sizes=[( 20.48, 51.2),
  2. ( 51.2, 133.12),
  3. ( 133.12, 215.04),
  4. ( 215.04, 296.96),
  5. ( 296.96, 378.88),
  6. ( 378.88, 460.8),
  7. ( 460.8, 542.72)],
  8. anchor_ratios=[[ 2, .5],
  9. [ 2, .5, 3, 1./ 3],
  10. [ 2, .5, 3, 1./ 3],
  11. [ 2, .5, 3, 1./ 3],
  12. [ 2, .5, 3, 1./ 3],
  13. [ 2, .5],
  14. [ 2, .5]],

我们看到他的Sk是大小,不是比例,论文上是0.2-0.9,而且你用512×0.2计算得到的也不是代码给的,所以这个地方其实框的大小是可以自己给的,可以根据经验给定。根据上面的计算得到框的大小。

最后是返回一个改层每个中心点坐标和框。存在layers_anchors,并返回,这个地方其实和Faster-Rcnn是一样的。也是anchor机制。

3  对anchor和GT的预处理

     我们看代码:


   
  1. def bboxes_encode(self, labels, bboxes, anchors,
  2. scope=None):
  3. """Encode labels and bounding boxes.
  4. """
  5. return ssd_common.tf_ssd_bboxes_encode(
  6. labels, bboxes, anchors,
  7. self.params.num_classes,
  8. self.params.no_annotation_label,
  9. ignore_threshold= 0.5,
  10. prior_scaling=self.params.prior_scaling,
  11. scope=scope)

一看就知道是调用了ssd_common.tf_ssd_bboxes_encode这个函数,我们看看:


   
  1. def tf_ssd_bboxes_encode(labels,
  2. bboxes,
  3. anchors,
  4. num_classes,
  5. no_annotation_label,
  6. ignore_threshold=0.5,
  7. prior_scaling=[0.1, 0.1, 0.2, 0.2],
  8. dtype=tf.float32,
  9. scope='ssd_bboxes_encode'):
  10. """Encode groundtruth labels and bounding boxes using SSD net anchors.
  11. Encoding boxes for all feature layers.
  12. Arguments:
  13. labels: 1D Tensor(int64) containing groundtruth labels;
  14. bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
  15. anchors: List of Numpy array with layer anchors;
  16. matching_threshold: Threshold for positive match with groundtruth bboxes;
  17. prior_scaling: Scaling of encoded coordinates.
  18. Return:
  19. (target_labels, target_localizations, target_scores):
  20. Each element is a list of target Tensors.
  21. """
  22. with tf.name_scope(scope):
  23. target_labels = []
  24. target_localizations = []
  25. target_scores = []
  26. for i, anchors_layer in enumerate(anchors):
  27. with tf.name_scope( 'bboxes_encode_block_%i' % i):
  28. t_labels, t_loc, t_scores = \
  29. tf_ssd_bboxes_encode_layer(labels, bboxes, anchors_layer,
  30. num_classes, no_annotation_label,
  31. ignore_threshold,
  32. prior_scaling, dtype)
  33. target_labels.append(t_labels)
  34. target_localizations.append(t_loc)
  35. target_scores.append(t_scores)
  36. ## t_labels 表示返回每个anchor对应的类别,t_loc返回的是一种变换,
  37. ## t_scores 每个anchor与gt对应的最大的交并比
  38. ## target_labels是一个list,包含每层的每个anchor对应的gt类别,
  39. ## target_localizations对应的是包含每一层所有anchor对应的变换
  40. ### target_scores 返回的是每个anchor与gt对应的最大的交并比
  41. return target_labels, target_localizations, target_scores

看上面的函数就知道,他们有是调用了tf.ssd_bboxes_encode_layer这个函数,有一个循环,是对需要预测的特征图一层一层的循环,然后我们看调用的函数


   
  1. def tf_ssd_bboxes_encode_layer(labels,
  2. bboxes,
  3. anchors_layer,
  4. num_classes,
  5. no_annotation_label,
  6. ignore_threshold=0.5,
  7. prior_scaling=[0.1, 0.1, 0.2, 0.2],
  8. dtype=tf.float32):
  9. """Encode groundtruth labels and bounding boxes using SSD anchors from
  10. one layer.
  11. Arguments:
  12. labels: 1D Tensor(int64) containing groundtruth labels;
  13. bboxes: Nx4 Tensor(float) with bboxes relative coordinates;
  14. anchors_layer: Numpy array with layer anchors;
  15. matching_threshold: Threshold for positive match with groundtruth bboxes;
  16. prior_scaling: Scaling of encoded coordinates.
  17. Return:
  18. (target_labels, target_localizations, target_scores): Target Tensors.
  19. """
  20. # Anchors coordinates and volume.
  21. yref, xref, href, wref = anchors_layer
  22. ymin = yref - href / 2.
  23. xmin = xref - wref / 2.
  24. ymax = yref + href / 2.
  25. xmax = xref + wref / 2.
  26. vol_anchors = (xmax - xmin) * (ymax - ymin)
  27. # Initialize tensors...
  28. shape = (yref.shape[ 0], yref.shape[ 1], href.size)
  29. feat_labels = tf.zeros(shape, dtype=tf.int64)
  30. feat_scores = tf.zeros(shape, dtype=dtype)
  31. feat_ymin = tf.zeros(shape, dtype=dtype)
  32. feat_xmin = tf.zeros(shape, dtype=dtype)
  33. feat_ymax = tf.ones(shape, dtype=dtype)
  34. feat_xmax = tf.ones(shape, dtype=dtype)
  35. def jaccard_with_anchors(bbox):
  36. """Compute jaccard score between a box and the anchors.
  37. """
  38. int_ymin = tf.maximum(ymin, bbox[ 0])
  39. int_xmin = tf.maximum(xmin, bbox[ 1])
  40. int_ymax = tf.minimum(ymax, bbox[ 2])
  41. int_xmax = tf.minimum(xmax, bbox[ 3])
  42. h = tf.maximum(int_ymax - int_ymin, 0.)
  43. w = tf.maximum(int_xmax - int_xmin, 0.)
  44. # Volumes.
  45. inter_vol = h * w
  46. union_vol = vol_anchors - inter_vol \
  47. + (bbox[ 2] - bbox[ 0]) * (bbox[ 3] - bbox[ 1])
  48. jaccard = tf.div(inter_vol, union_vol)
  49. return jaccard
  50. def intersection_with_anchors(bbox):
  51. """Compute intersection between score a box and the anchors.
  52. """
  53. int_ymin = tf.maximum(ymin, bbox[ 0])
  54. int_xmin = tf.maximum(xmin, bbox[ 1])
  55. int_ymax = tf.minimum(ymax, bbox[ 2])
  56. int_xmax = tf.minimum(xmax, bbox[ 3])
  57. h = tf.maximum(int_ymax - int_ymin, 0.)
  58. w = tf.maximum(int_xmax - int_xmin, 0.)
  59. inter_vol = h * w
  60. scores = tf.div(inter_vol, vol_anchors)
  61. return scores
  62. def condition(i, feat_labels, feat_scores,
  63. feat_ymin, feat_xmin, feat_ymax, feat_xmax):
  64. """Condition: check label index.
  65. """
  66. ### 逐元素比较大小,其实就是遍历label,因为i在body返回的时候加1了,直到遍历完
  67. r = tf.less(i, tf.shape(labels))
  68. return r[ 0]
  69. def body(i, feat_labels, feat_scores,
  70. feat_ymin, feat_xmin, feat_ymax, feat_xmax):
  71. """Body: update feature labels, scores and bboxes.
  72. Follow the original SSD paper for that purpose:
  73. - assign values when jaccard > 0.5;
  74. - only update if beat the score of other bboxes.
  75. """
  76. # Jaccard score.
  77. label = labels[i]
  78. bbox = bboxes[i]
  79. ### 返回的是交并比,算某一层上所有的框和图像中第一个框的交并比
  80. jaccard = jaccard_with_anchors(bbox)
  81. # Mask: check threshold + scores + no annotations + num_classes.
  82. ### 这个地方是帅选掉交并比小于0的
  83. mask = tf.greater(jaccard, feat_scores)
  84. # mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
  85. mask = tf.logical_and(mask, feat_scores > -0.5)
  86. mask = tf.logical_and(mask, label < num_classes)
  87. imask = tf.cast(mask, tf.int64)
  88. fmask = tf.cast(mask, dtype)
  89. # Update values using mask.
  90. feat_labels = imask * label + ( 1 - imask) * feat_labels
  91. ## tf.where表示如果mask为镇则jaccard,否则为feat_scores
  92. feat_scores = tf.where(mask, jaccard, feat_scores)
  93. ###
  94. feat_ymin = fmask * bbox[ 0] + ( 1 - fmask) * feat_ymin
  95. feat_xmin = fmask * bbox[ 1] + ( 1 - fmask) * feat_xmin
  96. feat_ymax = fmask * bbox[ 2] + ( 1 - fmask) * feat_ymax
  97. feat_xmax = fmask * bbox[ 3] + ( 1 - fmask) * feat_xmax
  98. # Check no annotation label: ignore these anchors...
  99. # interscts = intersection_with_anchors(bbox)
  100. # mask = tf.logical_and(interscts > ignore_threshold,
  101. # label == no_annotation_label)
  102. # # Replace scores by -1.
  103. # feat_scores = tf.where(mask, -tf.cast(mask, dtype), feat_scores)
  104. return [i+ 1, feat_labels, feat_scores,
  105. feat_ymin, feat_xmin, feat_ymax, feat_xmax]
  106. # Main loop definition.
  107. i = 0
  108. [i, feat_labels, feat_scores,
  109. feat_ymin, feat_xmin,
  110. feat_ymax, feat_xmax] = tf.while_loop(condition, body,
  111. [i, feat_labels, feat_scores,
  112. feat_ymin, feat_xmin,
  113. feat_ymax, feat_xmax])
  114. # Transform to center / size.
  115. feat_cy = (feat_ymax + feat_ymin) / 2.
  116. feat_cx = (feat_xmax + feat_xmin) / 2.
  117. feat_h = feat_ymax - feat_ymin
  118. feat_w = feat_xmax - feat_xmin
  119. # Encode features.
  120. ### prior_scaling=[0.1, 0.1, 0.2, 0.2]
  121. feat_cy = (feat_cy - yref) / href / prior_scaling[ 0]
  122. feat_cx = (feat_cx - xref) / wref / prior_scaling[ 1]
  123. feat_h = tf.log(feat_h / href) / prior_scaling[ 2]
  124. feat_w = tf.log(feat_w / wref) / prior_scaling[ 3]
  125. # Use SSD ordering: x / y / w / h instead of ours.
  126. feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis= -1)
  127. ## feat_labels 表示返回每个anchor对应的类别,feat_localizations返回的是一种变换,
  128. ## feat_scores 每个anchor与gt对应的最大的交并比
  129. return feat_labels, feat_localizations, feat_scores

看这部分写的有点不太好读,因为他是函数里面写函数,调用自己的函数,关键是他把自己写的函数放在中间,使得代码前面一半后面一半,中间是一些函数,不仔细往后看还以为结束了。

yref, xref, href, wref = anchors_layer
ymin = yref - href / 2.
xmin = xref - wref / 2.
ymax = yref + href / 2.
xmax = xref + wref / 2.
vol_anchors = (xmax - xmin) * (ymax - ymin)

# Initialize tensors…
shape = (yref.shape[0], yref.shape[1], href.size)
feat_labels = tf.zeros(shape, dtype=tf.int64)
feat_scores = tf.zeros(shape, dtype=dtype)

feat_ymin = tf.zeros(shape, dtype=dtype)
feat_xmin = tf.zeros(shape, dtype=dtype)
feat_ymax = tf.ones(shape, dtype=dtype)
feat_xmax = tf.ones(shape, dtype=dtype)

开头是这样的,ymin,xmin,ymax,xmax之类的是把之前的坐标换成了左上角和右上角的坐标,方便求交并比,注意这个地方像y_ref之类的都是一个numpy数组,是整个特征图所以的中心点,所以这个地方相当于是numpy的广播性质,可不是一个框的操作,而是整个层的操作,shape是tensor的形状,feat_labels,feat_scores,feat_ymin这些是为了保存结果的,形状应该和我们框坐标之类的一样。

   接下来,应该跳过那些函数,看后面的

# Main loop definition.
i = 0
[i, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax] = tf.while_loop(condition, body,
[i, feat_labels, feat_scores,
feat_ymin, feat_xmin,
feat_ymax, feat_xmax])
# Transform to center / size.
feat_cy = (feat_ymax + feat_ymin) / 2.
feat_cx = (feat_xmax + feat_xmin) / 2.
feat_h = feat_ymax - feat_ymin
feat_w = feat_xmax - feat_xmin
# Encode features.
### prior_scaling=[0.1, 0.1, 0.2, 0.2]
feat_cy = (feat_cy - yref) / href / prior_scaling[0]
feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
feat_h = tf.log(feat_h / href) / prior_scaling[2]
feat_w = tf.log(feat_w / wref) / prior_scaling[3]
# Use SSD ordering: x / y / w / h instead of ours.
feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)

tf.while_loop()这个函数是如果满足condition,则执行body,当然传递的参数就是后面的list,那我们看condition函数,


def condition(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):

### 逐元素比较大小,其实就是遍历label,因为ibody返回的时候加1了,直到遍历完
r = tf.less(i, tf.shape(labels))
return r[0]
我上面解释的很清楚,tf.less表示逐元素比较大小,就是如果i<tf.shape(labels)为真,然后返回第一个结果,这个地方就是在
遍历,相当于就是遍历所有真实的框,然后我们看body这个函数
def body(i, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax):

label = labels[i]
bbox = bboxes[i]
### 返回的是交并比,算某一层上所有的框和图像中第一个框的交并比
jaccard = jaccard_with_anchors(bbox)
# Mask: check threshold + scores + no annotations + num_classes.
### 这个地方是帅选掉交并比小于0
mask = tf.greater(jaccard, feat_scores)
# mask = tf.logical_and(mask, tf.greater(jaccard, matching_threshold))
mask = tf.logical_and(mask, feat_scores > -0.5)
mask = tf.logical_and(mask, label < num_classes)
imask = tf.cast(mask, tf.int64)
fmask = tf.cast(mask, dtype)
# Update values using mask.
feat_labels = imask * label + (1 - imask) * feat_labels
## tf.where表示如果mask为镇则jaccard,否则为feat_scores
feat_scores = tf.where(mask, jaccard, feat_scores)

feat_ymin = fmask * bbox[0] + (1 - fmask) * feat_ymin
feat_xmin = fmask * bbox[1] + (1 - fmask) * feat_xmin
feat_ymax = fmask * bbox[2] + (1 - fmask) * feat_ymax
feat_xmax = fmask * bbox[3] + (1 - fmask) * feat_xmax


return [i+1, feat_labels, feat_scores,
feat_ymin, feat_xmin, feat_ymax, feat_xmax]

jaccard_with_anchors 这个函数其实就是返回交并比,
def jaccard_with_anchors(bbox):
“”“Compute jaccard score between a box and the anchors.
“””
int_ymin = tf.maximum(ymin, bbox[0])
int_xmin = tf.maximum(xmin, bbox[1])
int_ymax = tf.minimum(ymax, bbox[2])
int_xmax = tf.minimum(xmax, bbox[3])
h = tf.maximum(int_ymax - int_ymin, 0.)
w = tf.maximum(int_xmax - int_xmin, 0.)
# Volumes.
inter_vol = h * w
union_vol = vol_anchors - inter_vol
+ (bbox[2] - bbox[0]) * (bbox[3] - bbox[1])
jaccard = tf.div(inter_vol, union_vol)
return jaccard

先求相交的坐标,然后求相交的面积,然后求交并比,比较简单。

这个地方是帅选掉交并比小于0

mask = tf.greater(jaccard, feat_scores)

tf.greater就是比较大小,如果jaccard>feat_scores则为真,否则为假。tf.logical_and表示两个同时为真才是真,

feat_labels = imask * label + (1 - imask) * feat_labels

上面这一句,当imask为1,那么就是label,否则label就是0,也就是背景,那imask什么时候为1,imask = tf.cast(mask, tf.int64),而mask又是大于feat_score的,所以这个地方因为是循环,遍历所有的目标,那么选择框的方式就是,选择交比比最大的,也就是某一个目标他对应的框里面,交并比最大的,这是一种策略,但是论文中还提到,高于0.5的我们也有对应的目标,但是代码没有这中策略,它只是选择了交并比最大的。feat_scores = tf.where(mask, jaccard, feat_scores),这个地方就是更新feat_scores,也就是体现是选择交并比最大的。

后面的feat_ymin之类的,也是跟着更新,如果该框的交并比大,那么就是保存为GT的bbox,然后返回,进行下一个循环。循环完了我们看后面的代码:

# Transform to center / size.
feat_cy = (feat_ymax + feat_ymin) / 2.
feat_cx = (feat_xmax + feat_xmin) / 2.
feat_h = feat_ymax - feat_ymin
feat_w = feat_xmax - feat_xmin
# Encode features.
### prior_scaling=[0.1, 0.1, 0.2, 0.2]
feat_cy = (feat_cy - yref) / href / prior_scaling[0]
feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
feat_h = tf.log(feat_h / href) / prior_scaling[2]
feat_w = tf.log(feat_w / wref) / prior_scaling[3]
# Use SSD ordering: x / y / w / h instead of ours.
feat_localizations = tf.stack([feat_cx, feat_cy, feat_w, feat_h], axis=-1)
## feat_labels 表示返回每个anchor对应的类别,feat_localizations返回的是一种变换,
## feat_scores 每个anchorgt对应的最大的交并比
return feat_labels, feat_localizations, feat_scores
feat_cy之类的是框的左上角和右下角坐标变为中心左边和场合宽,还是一样的,是numpy的广播,这个特征层一起变, 
prior_scaling这个其实我也不知道到为啥需要缩放,貌似论文没说要缩放,终点看这一快

feat_cy = (feat_cy - yref) / href / prior_scaling[0]
feat_cx = (feat_cx - xref) / wref / prior_scaling[1]
feat_h = tf.log(feat_h / href) / prior_scaling[2]
feat_w = tf.log(feat_w / wref) / prior_scaling[3]

这个其实就是论文的这个一块,


其实和我们的Faster rcnn是一样的,是求真实框与anchor之间的变换,你把上面随便一个移项,就会得anchor经过伸缩变换得到真实的框,所以这个地方回归的是一种变换,因为实际我们的框是存在的,然后经过我们回归得到的变换,经过变换得到真实框,所以这个地方损失函数其实是我们预测的是变换,我们实际的框和anchor之间的变换和我们预测的变换之间的loss。我们回归的是一种变换。并不是直接预测框,这个和YOLO是不一样的。和Faster RCNN是一样的。然后返回每一层的结果,放在

target_labels.append(t_labels)
target_localizations.append(t_loc)
target_scores.append(t_scores)
## t_labels 表示返回每个anchor对应的类别,t_loc返回的是一种变换,
## t_scores 每个anchorgt对应的最大的交并比
## target_labels是一个list,包含每层的每个anchor对应的gt类别,
## target_localizations对应的是包含每一层所有anchor对应的变换
### target_scores 返回的是每个anchorgt对应的最大的交并比

 接下来我们看损失函数:

4   SSD 损失函数

     我们先看代码:


 
  1. def losses(self, logits, localisations,
  2. gclasses, glocalisations, gscores,
  3. match_threshold=0.5,
  4. negative_ratio=3.,
  5. alpha=1.,
  6. label_smoothing=0.,
  7. scope=‘ssd_losses’):
  8. “”“Define the SSD network losses.
  9. “””
  10. return ssd_losses(logits, localisations,
  11. gclasses, glocalisations, gscores,
  12. match_threshold=match_threshold,
  13. negative_ratio=negative_ratio,
  14. alpha=alpha,
  15. label_smoothing=label_smoothing,
  16. scope=scope)

    这个地方也是调用其他函数,所以这个代码读起来挺费劲的,都是这个调那个,解释以下参数的含义,logits是每一层特征图输出,是没有经过softmax的,localistions是我们的预测框,带g的表示真实的,negative是正反例之比,是1:3,也就是负例是3,这个地方和论文是一样的。label_smoothing这个地方设置为0,并没有做平滑,记得在GAN的loss里面会用到。

然后我们看ssd_losses这个函数:


 
  1. def ssd_losses(logits, localisations,
  2. gclasses, glocalisations, gscores,
  3. match_threshold=0.5,
  4. negative_ratio=3.,
  5. alpha=1.,
  6. label_smoothing=0.,
  7. scope=None):
  8. “”“Loss functions for training the SSD 300 VGG network.
  9. This function defines the different loss components of the SSD, and
  10. adds them to the TF loss collection.
  11. Arguments:
  12. logits: (list of) predictions logits Tensors;
  13. localisations: (list of) localisations Tensors;
  14. gclasses: (list of) groundtruth labels Tensors;
  15. glocalisations: (list of) groundtruth localisations Tensors;
  16. gscores: (list of) groundtruth score Tensors;
  17. “””
  18. with tf.name_scope(scope, ‘ssd_losses’):
  19. l_cross_pos = []
  20. l_cross_neg = []
  21. l_loc = []
  22. for i in range(len(logits)):
  23. dtype = logits[i].dtype
  24. with tf.name_scope( ‘block_%i’ % i):
  25. # Determine weights Tensor.
  26. pmask = gscores[i] > match_threshold
  27. fpmask = tf.cast(pmask, dtype)
  28. n_positives = tf.reduce_sum(fpmask)
  29. # Select some random negative entries.
  30. # n_entries = np.prod(gclasses[i].get_shape().as_list())
  31. # r_positive = n_positives / n_entries
  32. # r_negative = negative_ratio * n_positives / (n_entries - n_positives)
  33. # Negative mask.
  34. no_classes = tf.cast(pmask, tf.int32)
  35. predictions = slim.softmax(logits[i])
  36. nmask = tf.logical_and(tf.logical_not(pmask),
  37. gscores[i] > -0.5)
  38. fnmask = tf.cast(nmask, dtype)
  39. nvalues = tf.where(nmask,
  40. predictions[:, :, :, :, 0],
  41. 1. - fnmask)
  42. nvalues_flat = tf.reshape(nvalues, [ -1])
  43. # Number of negative entries to select.
  44. n_neg = tf.cast(negative_ratio * n_positives, tf.int32)
  45. n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)
  46. n_neg = tf.maximum(n_neg, tf.shape(nvalues)[ 0] * 4)
  47. max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
  48. n_neg = tf.minimum(n_neg, max_neg_entries)
  49. val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
  50. minval = val[ -1]
  51. # Final negative mask.
  52. nmask = tf.logical_and(nmask, -nvalues > minval)
  53. fnmask = tf.cast(nmask, dtype)
  54. # Add cross-entropy loss.
  55. with tf.name_scope( ‘cross_entropy_pos’):
  56. loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
  57. labels=gclasses[i])
  58. loss = tf.losses.compute_weighted_loss(loss, fpmask)
  59. l_cross_pos.append(loss)
  60. with tf.name_scope( ‘cross_entropy_neg’):
  61. loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
  62. labels=no_classes)
  63. loss = tf.losses.compute_weighted_loss(loss, fnmask)
  64. l_cross_neg.append(loss)
  65. # Add localization loss: smooth L1, L2, …
  66. with tf.name_scope( ‘localization’):
  67. # Weights Tensor: positive mask + random negative.
  68. weights = tf.expand_dims(alpha * fpmask, axis= -1)
  69. loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i])
  70. loss = tf.losses.compute_weighted_loss(loss, weights)
  71. l_loc.append(loss)
  72. # Additional total losses…
  73. with tf.name_scope( ‘total’):
  74. total_cross_pos = tf.add_n(l_cross_pos, ‘cross_entropy_pos’)
  75. total_cross_neg = tf.add_n(l_cross_neg, ‘cross_entropy_neg’)
  76. total_cross = tf.add(total_cross_pos, total_cross_neg, ‘cross_entropy’)
  77. total_loc = tf.add_n(l_loc, ‘localization’)
  78. # Add to EXTRA LOSSES TF.collection
  79. tf.add_to_collection( ‘EXTRA_LOSSES’, total_cross_pos)
  80. tf.add_to_collection( ‘EXTRA_LOSSES’, total_cross_neg)
  81. tf.add_to_collection( ‘EXTRA_LOSSES’, total_cross)
  82. tf.add_to_collection( ‘EXTRA_LOSSES’, total_loc)
这个地方,
pmask = gscores[i] > match_threshold
fpmask = tf.cast(pmask, dtype)
n_positives = tf.reduce_sum(fpmask)
这个代码,这个地方又做一次帅选,如果交并比大于0.5,那么我们认为是正例,fpmask 记录正例和负例,n_positives这个是
正例的个数,
no_classes = tf.cast(pmask, tf.int32)
predictions = slim.softmax(logits[i])
nmask = tf.logical_and(tf.logical_not(pmask),
gscores[i] > -0.5)
fnmask = tf.cast(nmask, dtype)
nvalues = tf.where(nmask,
predictions[:, :, :, :, 0],
1. - fnmask)
nvalues_flat = tf.reshape(nvalues, [-1])

no_classes把布尔型变量变为整形,那么就是要么是0,要么是1,前景就是1,背景就是0,predictions是记录预测每个类的概率
nmask,就是负例,你看,tf.logical_not(pmask)就是取反,这个地方我觉得gscores[i] > -0.5,之前已经帅选了,就
是交并比不合适的,小于0的,这个地方应该。

nvalues就是把我们的类别提取出来,否则就是0,表示背景。后面就是做了一个拉伸。tf.where(cond,x,y)表示如果cond为真,就是x,否则就是y。

n_neg = tf.cast(negative_ratio * n_positives, tf.int32)
n_neg = tf.maximum(n_neg, tf.size(nvalues_flat) // 8)
n_neg = tf.maximum(n_neg, tf.shape(nvalues)[0] * 4)
max_neg_entries = 1 + tf.cast(tf.reduce_sum(fnmask), tf.int32)
n_neg = tf.minimum(n_neg, max_neg_entries)

val, idxes = tf.nn.top_k(-nvalues_flat, k=n_neg)
minval = val[-1]
# Final negative mask.
nmask = tf.logical_and(nmask, -nvalues > minval)
fnmask = tf.cast(nmask, dtype)n_neg就是负样本的数量,negative_ratio正负样本比列,默认就是3,后面的第一个取最大,我觉得是保证至少有负样本,

max_neg_entries这个就是负样本的数量,n_neg = tf.minimum(n_neg, max_neg_entries),这个比较很好理解,万一
你总样本比你三倍正样本少,所以需要选择小的,所以这个地方保证足够的负样本,nmask表示我们所选取的负样本,
tf.nn.top_k,这个是选取前k=neg个负例,因为取了负号,表示选择的交并比最小的k个,minval就是选择负例里面交并比
最大的,nmask就是把我们选择的负样例设为整数,就是提取出我们选择的,tf.logical_and就是同时为真,首先。需要是
负例,其次值需要大于minval,因为取了负数,所以nmask就是我们所选择的负例,fnmask就是就是我们选取的负样本只是
数据类型变了,由bool变为了浮点型,(dtype默认是浮点型),接着看损失函数:
# Add cross-entropy loss.
with tf.name_scope(‘cross_entropy_pos’):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
labels=gclasses[i])
loss = tf.losses.compute_weighted_loss(loss, fpmask)
l_cross_pos.append(loss)
这个是正例的损失,其实就是交叉熵损失,tf.losses.compute_weighted_loss其实就是相当于loss×fpmask,
这个地方之所以需要乘以fpmask是为了过滤掉负样本,因为负样本的label就是0,其他得是1.而fpmask刚好就是这样。
with tf.name_scope(‘cross_entropy_neg’):
loss = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits[i],
labels=no_classes)
loss = tf.losses.compute_weighted_loss(loss, fnmask)
l_cross_neg.append(loss)
这个是负例的损失函数,也是交叉熵损失,同时fnmask也是过滤掉正例,no_classes里面负例就是0,

 
  1. with tf.name_scope( ‘localization’):
  2. # Weights Tensor: positive mask + random negative.
  3. weights = tf.expand_dims(alpha * fpmask, axis= -1)
  4. loss = custom_layers.abs_smooth(localisations[i] - glocalisations[i])
  5. loss = tf.losses.compute_weighted_loss(loss, weights)
  6. l_loc.append(loss)

这个地方就是回归框损失,我们先看看论文回归框损失用的损失函数,就是smoothL1损失,这个样子:


然后我们再看这个函数代码:


 
  1. def abs_smooth(x):
  2. “”“Smoothed absolute function. Useful to compute an L1 smooth error.
  3. Define as:
  4. x^2 / 2 if abs(x) < 1
  5. abs(x) - 0.5 if abs(x) > 1
  6. We use here a differentiable definition using min(x) and abs(x). Clearly
  7. not optimal, but good enough for our purpose!
  8. “””
  9. absx = tf.abs(x)
  10. minx = tf.minimum(absx, 1)
  11. r = 0.5 * ((absx - 1) * minx + absx)
  12. return r

其实就是上面的损失函数,然后后面的weight也是过滤框没有目标的,之所有alpha是因为论文也有,但是默认就是1,现在我们看看论文的损失函数

其实和论文的损失函数是一样的。

关于代码,代码的训练部分还有很多其他内容,涉及多gpu,预处理等,但是核心思想就是这些,有机会在将其他的代码。


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值