人脸检测与识别：MTCNN人脸检测

最新推荐文章于 2022-05-12 10:33:40 发布

原创

最新推荐文章于 2022-05-12 10:33:40 发布 · 4.4k 阅读

34 ·

CC 4.0 BY-SA版权

本文详细介绍了基于MTCNN的人脸检测流程，包括数据获取、Pnet数据集制作、Pnet训练、Rnet和Onet的实现。通过使用WIDER_FACE和lfw_net数据集，构建了包含四个TFRecord的Pnet数据集，并讨论了训练过程中的问题和解决方案。最后，提供了人脸检测的检测器代码并展示了测试结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

注：此后更新代码版本均在github上，更快更准，博客不做更新~

2019.3.20

本文代码参考：

参考代码1：https://github.com/AITTSMD/MTCNN-Tensorflow

参考代码2：https://github.com/Seanlinx/mtcnn

参考代码3：https://github.com/CongWeilin/mtcnn-caffe

参考代码4：https://github.com/kpzhang93/MTCNN_face_detection_alignment

在此对其表示衷心的感谢。

基于MTCNN的人脸检测

项目环境及配置：Window10+GTX 1060+Python3.6+Anaconda5.2.0+Spyder+Tensorflow1.9-gpu

本文是对《Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks》论文的复现，此时网上已有N篇论文解读以及各种版本的代码复现，优劣参差不齐，大家可择优选读。

1、数据获取

本文训练集采用WIDER_FACE与lfw_net，选用与论文不同的数据集是因为Celeba数据集标注有很多错误，我在参考代码时看到了另一种数据集也可以很好的使用。

WIDER_FACE的标注格式在其wider_face_split 文件夹下的readme.txt 内。标注格式如下所示：

The format of txt ground truth.
File name
Number of bounding box
x1, y1, w, h, blur, expression, illumination, invalid, occlusion, pose

第一行为图片名称，第二行为人脸框数量，第三行为标注，本文只关注前4个标注。

lfw_net的标注格式可以在其下载网站的Face detector code 的readme.txt 内。标注格式如下所示：

Each line starts with the image name
followed by the left, right, top, and bottom boundary positions of the face bounding boxes.

每一行第一个字符串为图片名称，接下来分别为左、右、上、下坐标。

注：参考代码中数据集制作时引用的数据标注格式全都是错的，而生成hard_sample样本时格式又是对的？！我不知道这几位大佬是引用的是哪个改了名了标注文本或者用的.mat格式，反正正常下载后打开的txt格式跟他们代码写的完全不一样。遂想直接参考他们代码的童鞋还是先把我的数据集制作这块看完再跑路也不迟。

2、Pnet数据集制作

本项目数据集全部使用Tensorflow的TFRecord格式，比较方便。

TFRecord在输出数据时存在着shuffle不均匀的情况，我在项目1阶段制作数据的时候已经发现，所以必须放一点positive，放一点negative。在参考大佬代码时，大佬也提出全放进一个TFRecord内训练ONet与RNet时比例不均，所以本文三个网络的数据集全部由四个TFRecord构成。

由于本文有回归任务，所以需要记录每张截取人脸图片的bounding_box和landmark，参考大佬的思想，引入txt文档记录。

本文代码地址在文章后尾，可直接参考。

首先还是引入IoU的概念。

（tool.py）：

# -*- coding: utf-8 -*-
"""
@author: friedhelm

"""
import numpy as np
import cv2
import tensorflow as tf

def IoU(box, boxes):
    """
    Compute IoU between detect box and face boxes

    Parameters:
    ----------
    box: numpy array , shape (4, ): x1, y1, x2, y2
         random produced box
    boxes: numpy array, shape (n, 4): x1, y1, w, h
         input ground truth face boxes

    Returns:
    -------
    ovr: numpy.array, shape (n, )
        IoU
    """   
    
    box_area = (box[2] - box[0] + 1) * (box[3] - box[1] + 1)
    area = boxes[:, 2]*boxes[:, 3]
    
    x_right=boxes[:, 2]+boxes[:, 0]
    y_bottom=boxes[:, 3]+boxes[:, 1]
    
    xx1 = np.maximum(box[0], boxes[:, 0])
    yy1 = np.maximum(box[1], boxes[:, 1])
    xx2 = np.minimum(box[2], x_right)
    yy2 = np.minimum(box[3], y_bottom)

    # compute the width and height of the bounding box
    w = np.maximum(0, xx2 - xx1 + 1)
    h = np.maximum(0, yy2 - yy1 + 1)

    inter = w * h
    ovr = inter / (box_area + area - inter)
    return ovr


def NMS(box,_overlap):
    
    if len(box) == 0:
        return []
    
    #xmin, ymin, xmax, ymax, score, cropped_img, scale
    box.sort(key=lambda x :x[4])
    box.reverse()

    pick = []
    x_min = np.array([box[i][0] for i in range(len(box))],np.float32)
    y_min = np.array([box[i][1] for i in range(len(box))],np.float32)
    x_max = np.array([box[i][2] for i in range(len(box))],np.float32)
    y_max = np.array([box[i][3] for i in range(len(box))],np.float32)

    area = (x_max-x_min)*(y_max-y_min)
    idxs = np.array(range(len(box)))

    while len(idxs) > 0:
        i = idxs[0]
        pick.append(i)

        xx1 = np.maximum(x_min[i],x_min[idxs[1:]])
        yy1 = np.maximum(y_min[i],y_min[idxs[1:]])
        xx2 = np.minimum(x_max[i],x_max[idxs[1:]])
        yy2 = np.minimum(y_max[i],y_max[idxs[1:]])

        w = np.maximum(xx2-xx1,0)
        h = np.maximum(yy2-yy1,0)

        overlap = (w*h)/(area[idxs[1:]] + area[i] - w*h)

        idxs = np.delete(idxs, np.concatenate(([0],np.where(((overlap >= _overlap) & (overlap <= 1)))[0]+1)))
    
    return [box[i] for i in pick]


def featuremap(sess,graph,img,scale,map_shape,stride,threshold):
    
    left=0
    up=0
    boundingBox=[]
    
    images=graph.get_tensor_by_name("input/image:0")
    label= graph.get_tensor_by_name("output/label:0")
    roi= graph.get_tensor_by_name("output/roi:0")
    landmark= graph.get_tensor_by_name("output/landmark:0")
    img1=np.reshape(img,(-1,img.shape[0],img.shape[1],img.shape[2]))

    a,b,c=sess.run([label,roi,landmark],feed_dict={images:img1})
    a=np.reshape(a,(-1,2)) 
    b=np.reshape(b,(-1,4)) 
    c=np.reshape(c,(-1,10)) 
    for idx,prob in enumerate(a):
            
        if prob[1]>threshold:
            biasBox=[]
            biasBox.extend([float(left*stride)/scale,float(up*stride)/scale, float(left*stride+map_shape)/scale, float(up*stride+map_shape)/scale,prob[1]])
            biasBox.extend(b[idx])
            biasBox.extend(c[idx])
            boundingBox.append(biasBox)
            
        #防止左越界与下越界
        if (left*stride+map_shape<img.shape[1]):
            left+=1
        elif (up*stride+map_shape<img.shape[0]): 
            left=0
            up+=1
        else : break
            
    return boundingBox


def flip(img,facemark):
    img=cv2.flip(img,1)
    facemark[[0,1]]=facemark[[1,0]]
    facemark[[3,4]]=facemark[[4,3]]   
    return (img,facemark)


def read_single_tfrecord(addr,_batch_size,shape):
    
    filename_queue = tf.train.string_input_producer([addr],shuffle=True)

    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue) 

    features = tf.parse_single_example(serialized_example,
                                   features={
                                   'img':tf.FixedLenFeature([],tf.string),
                                   'label':tf.FixedLenFeature([],tf.int64),                                   
                                   'roi':tf.FixedLenFeature([4],tf.float32),
                                   'landmark':tf.FixedLenFeature([10],tf.float32),
                                   })
    img=tf.decode_raw(features['img'],tf.uint8)
    label=tf.cast(features['label'],tf.int32)
    roi=tf.cast(features['roi'],tf.float32)
    landmark=tf.cast(features['landmark'],tf.float32)
    img = tf.reshape(img, [shape,shape,3])     
    min_after_dequeue = 10000
    batch_size = _batch_size
    capacity = min_after_dequeue + 10 * batch_size
    image_batch, label_batch, roi_batch, landmark_batch = tf.train.shuffle_batch([img,label,roi,landmark], 
                                                        batch_size=batch_size, 
                                                        capacity=capacity, 
                                                        min_after_dequeue=min_after_dequeue,
                                                        num_threads=7)  
    
    label_batch = tf.reshape(label_batch, [batch_size])
    roi_batch = tf.reshape(roi_batch,[batch_size,4])
    landmark_batch = tf.reshape(landmark_batch,[batch_size,10])
    
    return image_batch, label_batch, roi_batch, landmark_batch

   
def read_multi_tfrecords(addr,_batch_size,shape):
    
    pos_dir,part_dir,neg_dir,landmark_dir = addr
    pos_batch_size,part_batch_size,neg_batch_size,landmark_batch_size = _batch_size   
    
    pos_image,pos_label,pos_roi,pos_landmark = read_single_tfrecord(pos_dir, pos_batch_size, shape)
    part_image,part_label,part_roi,part_landmark = read_single_tfrecord(part_dir, part_batch_size, shape)
    neg_image,neg_label,neg_roi,neg_landmark = read_single_tfrecord(neg_dir, neg_batch_size, shape)
    landmark_image,landmark_label,landmark_roi,landmark_landmark = read_single_tfrecord(landmark_dir, landmark_batch_size, shape)

    images = tf.concat([pos_image,part_image,neg_image,landmark_image], 0, name="concat/image")
    labels = tf.concat([pos_label,part_label,neg_label,landmark_label],0,name="concat/label")
    rois = tf.concat([pos_roi,part_roi,neg_roi,landmark_roi],0,name="concat/roi")
    landmarks = tf.concat([pos_landmark,part_landmark,neg_landmark,landmark_landmark],0,name="concat/landmark")
    
    return images,labels,rois,landmarks    
    

def image_color_distort(inputs):
    inputs = tf.image.random_contrast(inputs, lower=0.5, upper=1.5)
    inputs = tf.image.random_brightness(inputs, max_delta=0.2)
    inputs = tf.image.random_hue(inputs,max_delta= 0.2)
    inputs = tf.image.random_saturation(inputs,lower = 0.5, upper= 1.5)

    return inputs

首先进行人脸样本制作，需要生成pos、part以及neg样本。

x,y,w,h为图片参数，x1,y1,w1,h1为人脸框参数

1、neg样本制作时可以随机选择size大小，其范围在[12,min（w，h）/2]之间即可；左顶点的坐标范围选在[0,w-size]之间即可，因为负样本框不可超出原图像，右底点最大坐标为min（w,h）/2+w|h-min（w,h）/2。

3、pos样本制作，即在每个人脸框周围选择IoU大于0.65的正样本框。还是需要引入偏移量，已知人脸框中心点为x1|y1+（w1|h1）/2，我们令其size可在[（x1|y1）*0.8，（w1|h1）*1.2]范围内，即为人脸框长宽的0.8~1.2倍。引入偏移量delta范围设为[（x1|y1）*-0.2，（w1|h1）*0.2]，即偏移量为人脸框长宽的-0.2~0.2倍，此时人脸框中心点坐标加上偏移量坐标再减去size/2即为正样本框的左顶点坐标。

4、part样本与pos样本制作方法一致，在每个人脸框周围选择IoU小于0.65大于0.3的样本框即可。

5、在保存回归框坐标的时候保存的是偏移量坐标，即现在的框与真实的人脸框的偏移量，所以令nx1为样本框的x坐标，须计算offset_x1 = (x1 - nx1) / float(size)，保存。

（gen_classify_regression_data.py）：

# -*- coding: utf-8 -*-
"""
@author: friedhelm

"""
from core.tool import IoU
import numpy as np
from numpy.random import randint
import cv2
import os
import time

def main():

    f1 = open(os.path.join(save_dir, 'pos_%d.txt'%(img_size)), 'w')
    f2 = open(os.path.join(save_dir, 'neg_%d.txt'%(img_size)), 'w')
    f3 = open(os.path.join(save_dir, 'par_%d.txt'%(img_size)), 'w')    
    
    with open(WIDER_spilt_dir) as filenames:
        p=0
        neg_idx=0
        pos_idx=0
        par_idx=0
        for line in filenames.readlines():
            line=line.strip().split(' ')
            if(p==0):
                pic_dir=line[0]
                p=1
                boxes=[]
            elif(p==1):
                k=int(line[0])
                p=2
            elif(p==2):
                b=[]            
                k=k-1
                if(k==0):
                    p=0                
                for i in range(4):
                    b.append(int(line[i]))
                boxes.append(b)
                # format of boxes is [x,y,w,h]
                if(p==0):
                    img=cv2.imread(os.path.join(WIDER_dir,pic_dir).replace('/','\\'))
                    h,w,c=img.shape
                    
                    #save num negative pics whose IoU less than 0.3
                    num=50
                    while(num):
                        size=randint(12,min(w,h)/2)
                        x=randint(0,w-size)
                        y=randint(0,h-size)
                        if(np.max(IoU(np.array([x,y,x+size,y+size]),np.array(boxes)))<0.3):
                            resized_img = cv2.resize(img[y:y+size,x:x+size,:], (img_size, img_size))
                            cv2.imwrite(os.path.join(negative_dir,'neg_%d.jpg'%(neg_idx)),resized_img)
                            f2.write(os.path.join(negative_dir,'neg_%d.jpg'%(neg_idx)) + ' 0\n')
                            neg_idx=neg_idx+1
                            num=num-1       

                    for box in boxes:
                        if((box[0]<0)|(box[1]<0)|(max(box[2],box[3])<20)|(min(box[2],box[3])<=5)): 
                            continue  
                        x1, y1, w1, h1 = box
                        
                        # crop images near the bounding box if IoU less than 0.3, save as negative samples
                        for i in range(10):
                            size = randint(12, min(w, h) / 2)
                            delta_x = randint(max(-size, -x1), w1)
                            delta_y = randint(max(-size, -y1), h1)
                            nx1 = int(max(0, x1 + delta_x))
                            ny1 = int(max(0, y1 + delta_y))
                            if((nx1 + size > w1)|(ny1 + size > h1)):
                                continue