我的机器学习part(记录2020一次实践)
0.引言
什么炫酷的模型都是手段而已,解决问题才是目的。
国庆加班,做了个紧急项目,今天主要是记录机器学习在解决实际问题中的这一次经历。
主要部分:
(1)环境配置:这部分主要是同事来做,自己家里的Ubuntu是自己装的,公司的centos安装不一样,此处时间紧,分工配合,同事主配环境,我主作数据集、训练模型、搭建部分服务用于生产等。配置环境,一路摸索,也是吃了不少屎;centos下gpu驱动、cuda、cudnn,安装docker、tfserver等。
(2)模型训练:分类两部分,第一部分是两个图像打分模型,这一部分用的keras、resnet50;第二部分是利用决策树写规则,这里涉及print决策树的小trick。
(3)服务化部署:这里主要用两层,一层tf-server,接受图像数据;外层是一个web服务,接受图片url请求进行数据预处理、解析tf-server返回的请求并按决策树提取的规则输出。
1.环境配置:centos环境
-
声明:这里不抢功劳,纯属记录,毕竟这一部分屎主要是队友吃的@畅,手动狗头~
-
显卡驱动、cuda:用来模型的train、predict,主要参照https://www.cnblogs.com/yangsichao/p/8026234.html,队友跳过了步骤4,Cuda按着官网找新的装10.1及以上,顺序安装cudnn,Conda装 tensorflow_gpu,显卡驱动有一个地方可能需要参考这个https://blog.youkuaiyun.com/carey2017/article/details/82799850;
对于conda的命令记录一下,conda create -n tf python=3.7、conda activate tf、conda deactivate;安装客户端的tf-server就pip install tensorflow-serving-api; -
一些软件安装源的问题:用来方便安装软件包,比如yum源的配置https://www.cnblogs.com/yangp/p/8506264.html、pip源的配置pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple/ --upgrade tensorflow (可以去google搜一下其他源)、pip添加代理 pip install xxx --proxy http://xxx.xxx.xxxx:xxxx;
-
docker的安装:用来部署服务,主要参考https://blog.youkuaiyun.com/xykenny/article/details/90730369,报啥错,就去google,见招拆招,印象中服务端tf-server镜像拉取好像不是啥问题,官网或者随便搜搜教程就可,就是下载有点慢,据说有阿里云啥的加速方式,暂时没用到;
-
人脸识别工具:模型中的小工具,(1)face_recognition,有点慢,一开始用的这个,后来舍弃了,安装dlib,下载好含有setup.py的文件,运行python install setup.py,同样安装face_recognition-1.2.3、face_recognition_models-master(人脸识别的模型文件),这个工具,一开始也踩了不少坑,往上很多方式装不上,目前这种安装方式最稳,已经配置了很多机器了;(2)seetaFace6Python,中科院的吧,效果不错,而且速度很快,https://github.com/tensorflower/seetaFace6Python,记得添加环境变量,另外这个记得添加一下人家写的接口目录到代码里sys.path.append("/home/xx/xxx/python_tools/seetaface")
2.模型训练:resnet50、dt
环境都是python3.7
1.人脸截取的代码,统一到350 * 350的size,这里需要设计下resize填充,不然脸会被压缩或者拉长:
import sys
sys.path.append("/home/xx/xxx/python_tools/seetaface")
from PIL import Image
import glob as gb
from seetaface.api import *
import cv2
def get_model():
init_mask = FACE_DETECT
seetaFace = SeetaFace(init_mask)
seetaFace.SetProperty(DetectProperty.PROPERTY_MIN_FACE_SIZE, 80)
seetaFace.SetProperty(DetectProperty.PROPERTY_THRESHOLD, 0.7)
return seetaFace
seetaFace = get_model()
def get_face(img):
img = np.uint8(img)
img_gbr = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)
detect_result = seetaFace.Detect(img_gbr)
res_img = []
fill_color = (255, 255, 255)
face_ratio = 0.0
if detect_result.size != 0:
face = detect_result.data[0].pos
face_img = img[face.y:face.y+face.height, face.x:face.x+face.width]
pil_img = Image.fromarray(face_img)
x, y = pil_img.size
pil_img_resize = pil_img.resize((int(x*350.0/max(x,y)), int(y*350.0/max(x,y))))
x, y = pil_img_resize.size
new_im = Image.new('RGB', (350, 350), fill_color)
new_im.paste(pil_img_resize, (int((350 - x) / 2), int((350 - y) / 2)))
res_img = new_im.resize((350, 350))
face_ratio = face.height * face.width / (img.shape[0] * img.shape[1])
return np.asarray(res_img), face_ratio#res.save(output_dir + xx + '_face' + '.jpg')
#return res_img, face_ratio
2.数据集制作的etl代码,记录是为了方便以后“借鉴”,附有随机分割文件夹数据工具可用来训练测试集分割,此处再手动狗头…这部分没啥说的,就是图像的类型一会是cv2的格式一会儿是numpy数组格式,写的细致一些,说到这里,java这种要指定类型的语言倒是debug顺手了。
import os
import random
import shutil
def moveFile(file_dir, save_dir):
path_dir = os.listdir(file_dir)
filenumber = len(path_dir)
rate = 0.7 # 自定义抽取图片的比例,比方说100张抽10张,那就是0.1
picknumber = int(filenumber * rate) # 按照rate比例从文件夹中取一定数量图片
sample = random.sample(path_dir, picknumber) # 随机选取picknumber数量的样本图片
# print (sample)
for name in sample:
shutil.move(file_dir + '/' +name, save_dir + '/' +name)
def get_train_data_face(dir_arr ,label_str, img_height=350, img_width=350, channels=3):
train_arr = []
face_ratio = []
base_path = "/xxx_data/"
i = 0
for dir_t in dir_arr:
for fn in os.listdir(base_path + dir_t + "/" + label_str):
try:
img_raw = load_img(base_path + dir_t + "/" + label_str + "/" + fn)
#img = img.resize((350, 350))
img, ratio = get_face(img_raw)
#x = img_to_array(img).reshape(img_height, img_width, channels)
if(len(img)) != 0:
x = img.astype('float32') / 255.
train_arr.append(x)
face_ratio.append(ratio)
except:
pass
count = len(train_arr)
x_train = np.asarray(train_arr, dtype=np.float32)
y_train = np.asarray([float(label_str)]*count, dtype=np.float32)
face_ratio = np.asarray(face_ratio, dtype=np.float32)
return x_train, y_train, face_ratio
dir_arr = ["train_data"]
x_train_label_1,y_train_label_1,ratio_train_label_1 = get_train_data_face(dir_arr, "1")
x_train_label_2,y_train_label_2,ratio_train_label_2 = get_train_data_face(dir_arr, "2")
x_train_label_3,y_train_label_3,ratio_train_label_3 = get_train_data_face(dir_arr, "3")
x_train = np.concatenate((x_train_label_1,x_train_label_2,x_train_label_3), axis=0)
y_train = np.concatenate((y_train_label_1,y_train_label_2,y_train_label_3), axis=0)
ratio_train = np.concatenate((ratio_train_label_1,ratio_train_label_2,ratio_train_label_3), axis=0)
3.模型训练代码,别问,问就是用别人的轮子…
import os
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.applications import ResNet50
from keras.preprocessing.image import array_to_img, img_to_array, load_img
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping
from keras.models import load_model
from keras.optimizers import Adam
import cv2
#图像参数
img_width, img_height, channels = 350, 350, 3
input_shape = (img_width, img_height, channels)
resnet = ResNet50(include_top=False, pooling='avg', input_shape=input_shape)
model = Sequential()
model.add(resnet)
model.add(Dense(1))
model.summary()
#from keras.utils.training_utils import multi_gpu_model #版本问题,没有用上多卡训练,此处一卡够了
filepath="/xxx/xxx/mse-{epoch:02d}-{loss:.4f}.h5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
reduce_learning_rate = ReduceLROnPlateau(monitor='loss',
factor=0.1,
patience=1,
cooldown=2,
min_lr=0.00003,
verbose=1)
early_stop = EarlyStopping(monitor="loss", mode="min", patience=3)
callback_list = [checkpoint, reduce_learning_rate, early_stop]
model.compile(loss='mse', optimizer=Adam(lr=0.0003))
history = model.fit(x=x_train,
y=y_train,
batch_size=16,
epochs=30,
validation_data=(x_valid, y_valid),
callbacks=callback_list)
import tensorflow as tf
best_model = load_model('/xxx/lr_model/mse-xxx.h5')
y_predict_1 = best_model.predict(x_valid_label_1)
model.save('/home/work/xxx/model_to_save/', save_format='tf')
4.模型预测,这里有两个需要注意的,一个是注意输入数据的维度是否与训练的一致;另一个就是keras的训练、预测mode切换貌似是不需要指定,直接调用predict即可,印象中用torch的时候,需要把模型.eval(),不然dropOut这种玩意儿还会是train模式导致预测不准。
def get_pred(img_path, model_1, model_2):
#x_tensor = np.empty((1, 350, 350, 3), dtype=np.float32)
img_raw = load_img(img_path)
img = img_raw.resize((350, 350))
x = img_to_array(img).reshape(350, 350, 3)
x = x.astype('float32') / 255.
x = x[np.newaxis,:]
face_arr, ratio = get_face(img_raw)
val_1 = [[0.]]
if len(face_arr) != 0:
x_tensor = face_arr.astype('float32') / 255.
x_tensor = x_tensor[np.newaxis,:]
val_1 = model_1.predict(x_tensor)
#x_tensor[0] = x
val_2 = model_2.predict(x)
return val_1[0][0], val_2[0][0], ratio#, label
5.决策树打印提取规则,可以看到上面返回3个数值,前两个数值训练时拟合label,然后利用这3个分数再来拟合label提取规则;至于为啥不放在一起训练,主要是方便可解释,上面3个数值都有其意义,再说,不放在一起也未必就损失了太多效果,而且工程实际上未必敏感那么一丢准确率影响。
from sklearn.externals.six import StringIO
import pydotplus
from sklearn import tree
from sklearn.tree import _tree
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
test_data_pd = pd.read_csv("test_data.csv")
x1 = test_data_pd[(test_data_pd["label"] == 1)]
x2 = test_data_pd[(test_data_pd["label"] == 2)]
x3 = test_data_pd[(test_data_pd["label"] == 3)]
x_train = pd.concat([x1,x2,x3]).iloc[:, 0:3]
x_train['sum'] = x_train['val_1'] + x_train['val_2']
x_train['sub'] = x_train['val_1'] - x_train['val_2']
X = x_train
y = pd.concat([x1,x2,x3]).iloc[:,3]
#dtc = DTC(criterion='entropy',max_depth=3,min_samples_leaf=50) # 基于信息熵
#这个没用成功
def draw_tree(model,name):
dot_data = StringIO()
tree.export_graphviz(model, out_file = dot_data)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf(name + ".pdf")
#主要用到了这个
def tree_to_code(tree, feature_names):
tree_ = tree.tree_
feature_name = [
feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
for i in tree_.feature
]
print('feature_name:',feature_name)
print ("def tree({}):".format(", ".join(feature_names)))
def recurse(node, depth):
indent = " " * depth
# print('tree_.feature:',tree_.feature)
if tree_.feature[node] != _tree.TREE_UNDEFINED:
# print('tree_.feature[node]:',tree_.feature[node])
name = feature_name[node]
threshold = tree_.threshold[node]
print ("{}if {} <= {}:".format(indent, name, threshold))
recurse(tree_.children_left[node], depth + 1)
print ("{}else: # if {} > {}".format(indent, name, threshold))
recurse(tree_.children_right[node], depth + 1)
else:
print ("{}return {}".format(indent, tree_.value[node]))
recurse(0, 1)
estimator = DecisionTreeClassifier(max_depth=3)
estimator.fit(X, y)
tree_to_code (estimator, ["val_1", "val_2", "ratio","sum", "sub"])
draw_tree(estimator,'learn')
3.服务部署:docker、tf-server、tornado
这个我是第一次玩,初级玩法,轻喷。
1.tf-server启动,这里注意机器端口和docker端口的映射,这个docker端口8500、8501好像是写死在了这个tf-server镜像里;一些参数去google搜一下就好;我这边是多个机器起了多个docker进程服务;这个启动的时候,服务会选择model_tf、model_tf_2下最新的模型文;据说tf-server也方便服务热更新,不过这次多个机器倒是无所谓。
device_num=x
rpc_port=xxxx
rest_port=xxxx
model_path_src=$(dirname "$PWD")
docker run --runtime=nvidia --restart always -p $rpc_port:8500 -p $rest_port:8501 \
--mount type=bind,source=$model_path_src/model_tf,target=/models/face_model_1 \
--mount type=bind,source=$model_path_src/model_tf_2,target=/models/pic_model_1 \
--mount type=bind,source=$model_path_src/config_file/multi_model.config,target=/models/model.config \
--mount type=bind,source=$model_path_src/config_file/monitor.config,target=/monitor/monitor.config \
-e NVIDIA_VISIBLE_DEVICES=$device_num -t tensorflow/serving:latest-gpu --model_config_file=/models/model.config --monitoring_config_file=/monitor/monitor.config &
2.web服务,我这里是demo,这部分是我们出工具函数;然后同事大佬去部署的,mark一下,用Nginx的upstream做的负载均衡;这个tornado服务就是我搞着玩的。
import json
import grpc
import face_recognition
import numpy as np
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import tensorflow as tf
from skimage import io
import tornado
import tornado.web
from keras.preprocessing.image import array_to_img, img_to_array, load_img
#step1 :读取图像,进行预处理
def fetch_urls(url):
# 从url获取图片,返回像素数组numpy.ndarray,像素值uint8
image = io.imread(url)
return image
import sys
sys.path.append("/xxx/python_tools/seetaface")
from PIL import Image
import glob as gb
from seetaface.api import *
import cv2
def get_model():
init_mask = FACE_DETECT
seetaFace = SeetaFace(init_mask)
seetaFace.SetProperty(DetectProperty.PROPERTY_MIN_FACE_SIZE, 80)
seetaFace.SetProperty(DetectProperty.PROPERTY_THRESHOLD, 0.7)
return seetaFace
seetaFace = get_model()
def get_face(img):
img_gbr = cv2.cvtColor(img, cv2.COLOR_RGB2BGR)#np.array uint8
detect_result = seetaFace.Detect(img_gbr)
res_img = []
fill_color = (255, 255, 255)
face_ratio = 0.0
if detect_result.size != 0:
face = detect_result.data[0].pos
face_img = img[face.y:face.y+face.height, face.x:face.x+face.width]
pil_img = Image.fromarray(face_img)#image
x, y = pil_img.size
pil_img_resize = pil_img.resize((int(x*350.0/max(x,y)), int(y*350.0/max(x,y))))
x, y = pil_img_resize.size
new_im = Image.new('RGB', (350, 350), fill_color)
new_im.paste(pil_img_resize, (int((350 - x) / 2), int((350 - y) / 2)))
res_img = new_im.resize((350, 350))
face_ratio = face.height * face.width / (img.shape[0] * img.shape[1])
return np.asarray(res_img), face_ratio#res.save(output_dir + uid + '_face' + '.jpg')#unit8
#return res_img, face_ratio
#效率不够,弃用
# def get_face(image_raw):
# # 截取人脸,返回350*350*3的人脸,如果不存在人脸
# face_locations = face_recognition.face_locations(image_raw)
# if len(face_locations) != 0:
# top, right, bottom, left = face_locations[0]
# face_img = image_raw[top:bottom, left:right]
# pil_img = Image.fromarray(face_img)
# res = pil_img.resize((350, 350))#res.save(output_dir + uid + '_face' + '.jpg')
# return np.asarray(res)
def preprocessing(img_raw):
# 接收像素数组,截取人脸,处理为进入tf-server的tensor格式
pil_img = Image.fromarray(img_raw)
img = pil_img.resize((350, 350))
x = img_to_array(img).reshape(350, 350, 3)
x = x.astype('float32') / 255.
x = x[np.newaxis, :]
face_arr =[]
ratio = 0.0
face_arr, ratio = get_face(img_raw)
if len(face_arr) != 0:
x_tensor = face_arr.astype('float32') / 255.
x_tensor = x_tensor[np.newaxis, :]
return x,x_tensor, ratio
else:
return x, face_arr, ratio
#step2:发送图像tensor到tf-server,返回模型的结果
def inference(image_t , model_str):
channel = grpc.insecure_channel('机器ip:tf-server的rpc对应端口')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = model_str # 模型名称
request.model_spec.signature_name = "xx"#signature_name、inputs、outputs,可用saved_model_cli show --all --dir /x/xx/model_save/00001查看
request.inputs['xxx'].CopyFrom(tf.make_tensor_proto(image_t))#image_t x_train[0:1]
#print(time.time())
response = stub.Predict(request, 10.0) # 10 secs timeout
#print(time.time())
return response
#step3:获取图片的分类
def postprocessing(img):
pic, face, ratio = preprocessing(img)
res = xx
val_face = 0.0
val_pic = inference(pic,"xxx").outputs['xxx'].float_val[0]#对整个图片打分
if len(face) != 0:
val_face = inference(face,"xxx").outputs['xxx'].float_val[0]#对人脸进行打分
#一下这里就是决策树提取的规则了
...
return res
url = "https://xxx"
url_2 = "xxx"
img_raw = fetch_urls(url_2) # 异步下载图片url函数
result = postprocessing(img_raw) # result后处理函数
class MainHandler(tornado.web.RequestHandler):
def get(self):
"""处理GET请求
"""
text = self.get_argument("imageURL")
predict = self.predic_tf(text)
data = {
'imageURL' : text,
'predict' : predict
}
self.write(json.dumps({'data': data}))
def predic_tf(self, text):
"""调用tf-server
"""
img_raw = fetch_urls(text) # 异步下载图片url函数,这里没写异步...
result = postprocessing(img_raw) # result后处理函数
return result
def make_app():
return tornado.web.Application([
(r"/predict", MainHandler)
])
if __name__ == '__main__':
app = make_app()
app.listen(8131) # tornado服务端监听端口
tornado.ioloop.IOLoop.current().start()
#curl ip:8131/predict?imageURL=https://xxxxxx
5.结语
以往的上线都是etl产出数据表,无论是spark调用模型或者spark pipe,都是离线的形式,这次是正儿八经接触到服务,虽说很菜,但也是涨了姿势,还行…
下次见。