我的机器学习-part（2020-2）【tf-server】

本文链接：https://blog.youkuaiyun.com/github_38414650/article/details/108982507

我的机器学习part（记录2020一次实践）

0.引言

什么炫酷的模型都是手段而已，解决问题才是目的。

国庆加班，做了个紧急项目，今天主要是记录机器学习在解决实际问题中的这一次经历。

主要部分：

（1）环境配置：这部分主要是同事来做，自己家里的Ubuntu是自己装的，公司的centos安装不一样，此处时间紧，分工配合，同事主配环境，我主作数据集、训练模型、搭建部分服务用于生产等。配置环境，一路摸索，也是吃了不少屎；centos下gpu驱动、cuda、cudnn，安装docker、tfserver等。

（2）模型训练：分类两部分，第一部分是两个图像打分模型，这一部分用的keras、resnet50；第二部分是利用决策树写规则，这里涉及print决策树的小trick。

（3）服务化部署：这里主要用两层，一层tf-server，接受图像数据；外层是一个web服务，接受图片url请求进行数据预处理、解析tf-server返回的请求并按决策树提取的规则输出。

1.环境配置：centos环境

声明：这里不抢功劳，纯属记录，毕竟这一部分屎主要是队友吃的@畅，手动狗头~
显卡驱动、cuda：用来模型的train、predict，主要参照https://www.cnblogs.com/yangsichao/p/8026234.html，队友跳过了步骤4，Cuda按着官网找新的装10.1及以上，顺序安装cudnn，Conda装 tensorflow_gpu，显卡驱动有一个地方可能需要参考这个https://blog.youkuaiyun.com/carey2017/article/details/82799850；
对于conda的命令记录一下，conda create -n tf python=3.7、conda activate tf、conda deactivate；安装客户端的tf-server就pip install tensorflow-serving-api；
一些软件安装源的问题：用来方便安装软件包，比如yum源的配置https://www.cnblogs.com/yangp/p/8506264.html、pip源的配置pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple/ --upgrade tensorflow （可以去google搜一下其他源）、pip添加代理 pip install xxx --proxy http://xxx.xxx.xxxx:xxxx；
docker的安装：用来部署服务，主要参考https://blog.youkuaiyun.com/xykenny/article/details/90730369，报啥错，就去google，见招拆招，印象中服务端tf-server镜像拉取好像不是啥问题，官网或者随便搜搜教程就可，就是下载有点慢，据说有阿里云啥的加速方式，暂时没用到；
人脸识别工具：模型中的小工具，（1）face_recognition，有点慢，一开始用的这个，后来舍弃了，安装dlib，下载好含有setup.py的文件，运行python install setup.py，同样安装face_recognition-1.2.3、face_recognition_models-master（人脸识别的模型文件），这个工具，一开始也踩了不少坑，往上很多方式装不上，目前这种安装方式最稳，已经配置了很多机器了；（2）seetaFace6Python，中科院的吧，效果不错，而且速度很快，https://github.com/tensorflower/seetaFace6Python，记得添加环境变量，另外这个记得添加一下人家写的接口目录到代码里sys.path.append("/home/xx/xxx/python_tools/seetaface")

2.模型训练：resnet50、dt

环境都是python3.7

1.人脸截取的代码，统一到350 * 350的size，这里需要设计下resize填充，不然脸会被压缩或者拉长：

import sys
sys.path.append("/home/xx/xxx/python_tools/seetaface")
from PIL import Image
import glob as gb
from seetaface.api import *
import cv2
def get_model():
    init_mask = FACE_DETECT
    seetaFace = SeetaFace(init_mask)
    seetaFace.SetProperty(DetectProperty.PROPERTY_MIN_FACE_SIZE, 80)
    seetaFace.SetProperty(DetectProperty.PROPERTY_THRESHOLD, 0.7)
    return seetaFace

seetaFace = get_model()

def get_face(img):
    img = np.uint8(img)
    img_gbr = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
    detect_result = seetaFace.Detect(img_gbr)
    res_img = []
    fill_color = (255, 255, 255)
    face_ratio = 0.0
    if detect_result.size != 0:
        face = detect_result.data[0].pos
        face_img = img[face.y:face.y+face.height, face.x:face.x+face.width]
        pil_img = Image.fromarray(face_img)
        x, y = pil_img.size
        pil_img_resize = pil_img.resize((int(x*350.0/max(x,y)), int(y*350.0/max(x,y))))
        x, y = pil_img_resize.size
        new_im = Image.new('RGB', (350, 350), fill_color)
        new_im.paste(pil_img_resize, (int((350 - x) / 2), int((350 - y) / 2)))
        res_img = new_im.resize((350, 350))
        face_ratio = face.height * face.width / (img.shape[0] * img.shape[1])
    return np.asarray(res_img), face_ratio#res.save(output_dir + xx + '_face' + '.jpg')
    #return res_img, face_ratio

2.数据集制作的etl代码，记录是为了方便以后“借鉴”，附有随机分割文件夹数据工具可用来训练测试集分割，此处再手动狗头…这部分没啥说的，就是图像的类型一会是cv2的格式一会儿是numpy数组格式，写的细致一些，说到这里，java这种要指定类型的语言倒是debug顺手了。

import os
import random
import shutil
def moveFile(file_dir, save_dir):
    path_dir = os.listdir(file_dir)
    filenumber = len(path_dir)
    rate = 0.7  # 自定义抽取图片的比例，比方说100张抽10张，那就是0.1
    picknumber = int(filenumber * rate)  # 按照rate比例从文件夹中取一定数量图片
    sample = random.sample(path_dir, picknumber)  # 随机选取picknumber数量的样本图片
    # print (sample)
    for name in sample:
        shutil.move(file_dir + '/' +name, save_dir + '/' +name)

def get_train_data_face(dir_arr ,label_str, img_height=350, img_width=350, channels=3):
    train_arr = []
    face_ratio = []
    base_path = "/xxx_data/"
    i = 0
    for dir_t in dir_arr:
        for fn in os.listdir(base_path + dir_t + "/" + label_str):
            try:
                img_raw = load_img(base_path + dir_t + "/" + label_str + "/" + fn)
                #img = img.resize((350, 350))
                img, ratio = get_face(img_raw)
                #x = img_to_array(img).reshape(img_height, img_width, channels)
                if(len(img)) != 0:
                    x = img.astype('float32') / 255.
                    train_arr.append(x)
                    face_ratio.append(ratio)
            except:
                pass
    count = len(train_arr)
    x_train = np.asarray(train_arr, dtype=np.float32)
    y_train = np.asarray([float(label_str)]*count, dtype=np.float32)
    face_ratio = np.asarray(face_ratio, dtype=np.float32)
    return x_train, y_train, face_ratio

dir_arr = ["train_data"]
x_train_label_1,y_train_label_1,ratio_train_label_1 = get_train_data_face(dir_arr, "1")
x_train_label_2,y_train_label_2,ratio_train_label_2 = get_train_data_face(dir_arr, "2")
x_train_label_3,y_train_label_3,ratio_train_label_3 = get_train_data_face(dir_arr, "3")
x_train = np.concatenate((x_train_label_1,x_train_label_2,x_train_label_3), axis=0)
y_train = np.concatenate((y_train_label_1,y_train_label_2,y_train_label_3), axis=0)
ratio_train = np.concatenate((ratio_train_label_1,ratio_train_label_2,ratio_train_label_3), axis=0)

3.模型训练代码，别问，问就是用别人的轮子…

import os
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.applications import ResNet50
from keras.preprocessing.image import array_to_img, img_to_array, load_img
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from keras.models import load_model
from keras.optimizers import Adam
import cv2

#图像参数
img_width, img_height, channels = 350, 350, 3
input_shape = (img_width, img_height, channels)
resnet = ResNet50(include_top=False, pooling='avg', input_shape=input_shape)
model = Sequential()
model.add(resnet)
model.add(Dense(1))
model.summary()
#from keras.utils.training_utils import multi_gpu_model #版本问题，没有用上多卡训练，此处一卡够了

filepath="/xxx/xxx/mse-{epoch:02d}-{loss:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
reduce_learning_rate = ReduceLROnPlateau(monitor='loss',
                                         factor=0.1,
                                         patience=1,
                                         cooldown=2,
                                         min_lr=0.00003,
                                         verbose=1)
early_stop = EarlyStopping(monitor="loss", mode="min", patience=3)
callback_list = [checkpoint, reduce_learning_rate, early_stop]
model.compile(loss='mse', optimizer=Adam(lr=0.0003))
history = model.fit(x=x_train,
                    y=y_train,
                    batch_size=16,
                    epochs=30,
                    validation_data=(x_valid, y_valid),
                    callbacks=callback_list)

import tensorflow as tf
best_model = load_model('/xxx/lr_model/mse-xxx.h5')
y_predict_1 = best_model.predict(x_valid_label_1)

model.save('/home/work/xxx/model_to_save/', save_format='tf')

4.模型预测，这里有两个需要注意的，一个是注意输入数据的维度是否与训练的一致；另一个就是keras的训练、预测mode切换貌似是不需要指定，直接调用predict即可，印象中用torch的时候，需要把模型.eval()，不然dropOut这种玩意儿还会是train模式导致预测不准。

def get_pred(img_path, model_1, model_2):
    #x_tensor = np.empty((1, 350, 350, 3), dtype=np.float32)
    img_raw = load_img(img_path)
    img = img_raw.resize((350, 350))
    x = img_to_array(img).reshape(350, 350, 3)
    x = x.astype('float32') / 255.
    x = x[np.newaxis,:]
    face_arr, ratio = get_face(img_raw)
    val_1 = [[0.]]
    if len(face_arr) != 0:
        x_tensor = face_arr.astype('float32') / 255.
        x_tensor = x_tensor[np.newaxis,:]
        val_1 = model_1.predict(x_tensor)
    #x_tensor[0] = x
    val_2 = model_2.predict(x)
    return val_1[0][0], val_2[0][0], ratio#, label

5.决策树打印提取规则，可以看到上面返回3个数值，前两个数值训练时拟合label，然后利用这3个分数再来拟合label提取规则；至于为啥不放在一起训练，主要是方便可解释，上面3个数值都有其意义，再说，不放在一起也未必就损失了太多效果，而且工程实际上未必敏感那么一丢准确率影响。

from sklearn.externals.six import StringIO
import pydotplus
from sklearn import tree
from sklearn.tree import _tree
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

test_data_pd = pd.read_csv("test_data.csv")
x1 = test_data_pd[(test_data_pd["label"] == 1)]
x2 = test_data_pd[(test_data_pd["label"] == 2)]
x3 = test_data_pd[(test_data_pd["label"] == 3)]
x_train = pd.concat([x1,x2,x3]).iloc[:, 0:3]
x_train['sum'] = x_train['val_1'] + x_train['val_2']
x_train['sub'] = x_train['val_1'] - x_train['val_2']
X = x_train
y = pd.concat([x1,x2,x3]).iloc[:,3]
#dtc = DTC(criterion='entropy',max_depth=3,min_samples_leaf=50)  # 基于信息熵

#这个没用成功
def draw_tree(model,name):
    dot_data = StringIO()
    tree.export_graphviz(model, out_file = dot_data)
    graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
    graph.write_pdf(name + ".pdf")
 
#主要用到了这个
def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print('feature_name:',feature_name)
    print ("def tree({}):".format(", ".join(feature_names)))
 
    def recurse(node, depth):
        indent = "  " * depth
        # print('tree_.feature:',tree_.feature)
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            # print('tree_.feature[node]:',tree_.feature[node])
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print ("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print ("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print ("{}return {}".format(indent, tree_.value[node]))
 
    recurse(0, 1)


estimator = DecisionTreeClassifier(max_depth=3) 
estimator.fit(X, y)
tree_to_code (estimator, ["val_1", "val_2", "ratio","sum", "sub"])
draw_tree(estimator,'learn')

3.服务部署：docker、tf-server、tornado

这个我是第一次玩，初级玩法，轻喷。

1.tf-server启动，这里注意机器端口和docker端口的映射，这个docker端口8500、8501好像是写死在了这个tf-server镜像里；一些参数去google搜一下就好；我这边是多个机器起了多个docker进程服务；这个启动的时候，服务会选择model_tf、model_tf_2下最新的模型文；据说tf-server也方便服务热更新，不过这次多个机器倒是无所谓。

device_num=x
rpc_port=xxxx
rest_port=xxxx
model_path_src=$(dirname "$PWD")

docker run --runtime=nvidia --restart always -p $rpc_port:8500 -p $rest_port:8501 \
    --mount type=bind,source=$model_path_src/model_tf,target=/models/face_model_1 \
    --mount type=bind,source=$model_path_src/model_tf_2,target=/models/pic_model_1 \
    --mount type=bind,source=$model_path_src/config_file/multi_model.config,target=/models/model.config \
    --mount type=bind,source=$model_path_src/config_file/monitor.config,target=/monitor/monitor.config \
    -e NVIDIA_VISIBLE_DEVICES=$device_num -t tensorflow/serving:latest-gpu --model_config_file=/models/model.config --monitoring_config_file=/monitor/monitor.config &

2.web服务，我这里是demo，这部分是我们出工具函数；然后同事大佬去部署的，mark一下，用Nginx的upstream做的负载均衡；这个tornado服务就是我搞着玩的。

import json
import grpc
import face_recognition
import numpy as np
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import tensorflow as tf
from skimage import io
import tornado
import tornado.web
from keras.preprocessing.image import array_to_img, img_to_array, load_img

#step1 :读取图像，进行预处理
def fetch_urls(url):
    # 从url获取图片，返回像素数组numpy.ndarray，像素值uint8
    image = io.imread(url)
    return image

import sys
sys.path.append("/xxx/python_tools/seetaface")
from PIL import Image
import glob as gb
from seetaface.api import *
import cv2

def get_model():
    init_mask = FACE_DETECT
    seetaFace = SeetaFace(init_mask)
    seetaFace.SetProperty(DetectProperty.PROPERTY_MIN_FACE_SIZE, 80)
    seetaFace.SetProperty(DetectProperty.PROPERTY_THRESHOLD, 0.7)
    return seetaFace

seetaFace = get_model()

def get_face(img):
    img_gbr = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)#np.array uint8
    detect_result = seetaFace.Detect(img_gbr)
    res_img = []
    fill_color = (255, 255, 255)
    face_ratio = 0.0
    if detect_result.size != 0:
        face = detect_result.data[0].pos
        face_img = img[face.y:face.y+face.height, face.x:face.x+face.width]
        pil_img = Image.fromarray(face_img)#image
        x, y = pil_img.size
        pil_img_resize = pil_img.resize((int(x*350.0/max(x,y)), int(y*350.0/max(x,y))))
        x, y = pil_img_resize.size
        new_im = Image.new('RGB', (350, 350), fill_color)
        new_im.paste(pil_img_resize, (int((350 - x) / 2), int((350 - y) / 2)))
        res_img = new_im.resize((350, 350))
        face_ratio = face.height * face.width / (img.shape[0] * img.shape[1])
    return np.asarray(res_img), face_ratio#res.save(output_dir + uid + '_face' + '.jpg')#unit8
    #return res_img, face_ratio

#效率不够，弃用
# def get_face(image_raw):
#     # 截取人脸，返回350*350*3的人脸，如果不存在人脸
#     face_locations = face_recognition.face_locations(image_raw)
#     if len(face_locations) != 0:
#         top, right, bottom, left = face_locations[0]
#         face_img = image_raw[top:bottom, left:right]
#         pil_img = Image.fromarray(face_img)
#         res = pil_img.resize((350, 350))#res.save(output_dir + uid + '_face' + '.jpg')
#     return np.asarray(res)

def preprocessing(img_raw):
    # 接收像素数组，截取人脸，处理为进入tf-server的tensor格式
    pil_img = Image.fromarray(img_raw)
    img = pil_img.resize((350, 350))
    x = img_to_array(img).reshape(350, 350, 3)
    x = x.astype('float32') / 255.
    x = x[np.newaxis, :]
    face_arr =[]
    ratio = 0.0
    face_arr, ratio = get_face(img_raw)
    if len(face_arr) != 0:
        x_tensor = face_arr.astype('float32') / 255.
        x_tensor = x_tensor[np.newaxis, :]
        return x,x_tensor, ratio
    else:
        return x, face_arr, ratio

#step2：发送图像tensor到tf-server，返回模型的结果
def inference(image_t , model_str):
    channel = grpc.insecure_channel('机器ip:tf-server的rpc对应端口')
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_str # 模型名称
    request.model_spec.signature_name = "xx"#signature_name、inputs、outputs，可用saved_model_cli show --all --dir /x/xx/model_save/00001查看
    request.inputs['xxx'].CopyFrom(tf.make_tensor_proto(image_t))#image_t   x_train[0:1]
    #print(time.time())
    response = stub.Predict(request, 10.0)  # 10 secs timeout
    #print(time.time())
    return response

#step3：获取图片的分类
def postprocessing(img):
    pic, face, ratio = preprocessing(img)
    res = xx
    val_face = 0.0
    val_pic = inference(pic,"xxx").outputs['xxx'].float_val[0]#对整个图片打分
    if len(face) != 0:
        val_face = inference(face,"xxx").outputs['xxx'].float_val[0]#对人脸进行打分
    #一下这里就是决策树提取的规则了
    ...
    return res

url = "https://xxx"
url_2 = "xxx"
img_raw = fetch_urls(url_2)  # 异步下载图片url函数
result = postprocessing(img_raw)  # result后处理函数

class MainHandler(tornado.web.RequestHandler):
    def get(self):
        """处理GET请求
        """
        text = self.get_argument("imageURL")
        predict = self.predic_tf(text)
        data = {
            'imageURL' : text,
            'predict' : predict
        }
        self.write(json.dumps({'data': data}))
    def predic_tf(self, text):
        """调用tf-server
        """
		img_raw = fetch_urls(text)  # 异步下载图片url函数，这里没写异步...
		result = postprocessing(img_raw)  # result后处理函数
        return result

def make_app():
    return tornado.web.Application([
        (r"/predict", MainHandler)
    ])

if __name__ == '__main__':
    app = make_app()
    app.listen(8131)  # tornado服务端监听端口
    tornado.ioloop.IOLoop.current().start()

#curl ip:8131/predict?imageURL=https://xxxxxx