语音关键词分类 speech_commands:https://github.com/tensorflow/docs/blob/master/site/en/r1/tutorials/sequences/audio_recognition.md#running_the_model_in_an_android_app
实现过程
1. 查看模型结构(部分内容)
2. 前部分音频处理逻辑顺序: wave_data -> DecodeWav -> AudioSpectrogram -> Mfcc ,所以说我们需要不先依赖tf来实现这么逻辑处理。这部分内容参考:https://github.com/huanghao128/tensorflow-mfcc,这位大哥实现了tf-mfcc,所以我拿过来稍微改造了一下就可以直接使用了,再次非常感谢。
3. 获取权重( 注意数据格式 tf:NHWC trt: NCHW )
# 写这段代码的时候,只有上帝和我知道它是干嘛的
# 现在,只有上帝知道
# @File : pb2wts.py
# @Time : 2020/11/26 22:58
# @Author : J.
# @desc : pb 导出权重,生成wts文件
import tensorflow as tf
from tensorflow.python.platform import gfile
import struct
import torch
from tensorflow.python.framework import tensor_util
# path to your .pb file
GRAPH_PB_PATH = './model/kws.pb'
GRAPH_WTS_PATH = './model/kws.wts'
with tf.Session() as sess:
print("load graph")
with gfile.FastGFile(GRAPH_PB_PATH, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
sess.graph.as_default()
tf.import_graph_def(graph_def, name='')
graph_nodes = [n for n in graph_def.node]
wts = [n for n in graph_nodes if n.op == 'Const']
dict = {}
for n in wts:
v = n.attr['value']
print(n.name)
ar = tensor_util.MakeNdarray(v.tensor)
dict[n.name] = torch.Tensor(ar)
f = open(GRAPH_WTS_PATH, 'w')
f.write("{}\n".format(len(dict.keys())))
for k, v in dict.items():
print('key: ', k)
print('value: ', v.shape)
if v.ndim == 4: # tf:NHWC trt:NCHW
v = v.transpose(3, 0).transpose(2, 1).transpose(3, 2)
vr = v.reshape(-1).cpu().numpy()
elif v.ndim == 2:
v = v.transpose(1, 0)
vr = v.reshape(-1).cpu().numpy()
else:
vr = v.reshape(-1).cpu().numpy()
f.write("{} {}".format(k, len(vr)))
for vv in vr:
f.write(" ")
f.write(struct.pack(">f", float(vv)).hex())
f.write("\n")
4. 获取音频的MFCC当模型的输入,然后一层一层搭建就可以了。
Reshape:
ILayer* reshape(INetworkDefinition* network, ITensor& input, Dims dims) {
IShuffleLayer *shuffleLayer = network->addShuffle(input);
assert(shuffleLayer);
shuffleLayer->setReshapeDimensions(dims);
return shuffleLayer;
}
conv + bn + relu
ILayer* convBnReLU(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname,
int nbOutputMaps, Dims kernelSize, Dims strideSize, int groupSize) {
Dims d = input.getDimensions();
Weights emptywts{ DataType::kFLOAT, nullptr, 0 };
IConvolutionLayer* conv = network->addConvolutionNd(input, nbOutputMaps, DimsHW{ kernelSize.d[0] , kernelSize.d[1] }, weightMap[lname + "/weights"], weightMap[lname + "/biases"]);
assert(conv);
conv->setStrideNd(DimsHW{ strideSize.d[0], strideSize.d[1] });
int padSizeL = paddingSize(d.d[1], kernelSize.d[0], strideSize.d[0]);
int padSizeT = paddingSize(d.d[2], kernelSize.d[1], strideSize.d[1]);
int postPaddingL = ceil(padSizeL / 2.0);
int postPaddingT = ceil(padSizeT / 2.0);
int prePaddingL = padSizeL - postPaddingL;
int prePaddingT = padSizeT - postPaddingT;
if (prePaddingL > 0 || prePaddingT > 0)
conv->setPrePadding(DimsHW{ prePaddingL, prePaddingT });
if (postPaddingL > 0 || postPaddingT > 0)
conv->setPostPadding(DimsHW{ postPaddingL, postPaddingT });
conv->setNbGroups(groupSize);
IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname, 1e-3);
IActivationLayer* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);
assert(relu);
return relu;
}
depthwiseConvolutionNd
ILayer* depthwiseConvolutionNd(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname,
int nbOutputMaps, Dims kernelSize, Dims strideSize) {
Weights emptywts{ DataType::kFLOAT, nullptr, 0 };
Dims d = input.getDimensions();
int size = d.d[0];
IConvolutionLayer* conv = network->addConvolutionNd(input, size, DimsHW{ kernelSize.d[0], kernelSize.d[1] }, weightMap[lname + "/dw_conv/depthwise_weights"], weightMap[lname + "/dw_conv/biases"]);
conv->setStrideNd(DimsHW{ strideSize.d[0], strideSize.d[1] });
int padSizeL = paddingSize(d.d[1], kernelSize.d[0], strideSize.d[0]);
int padSizeT = paddingSize(d.d[2], kernelSize.d[1], strideSize.d[1]);
int postPaddingL = ceil(padSizeL / 2.0);
int postPaddingT = ceil(padSizeT / 2.0);
int prePaddingL = padSizeL - postPaddingL;
int prePaddingT = padSizeT - postPaddingT;
if (prePaddingL > 0 || prePaddingT > 0)
conv->setPrePadding(DimsHW{ prePaddingL, prePaddingT });
if (postPaddingL > 0 || postPaddingT > 0)
conv->setPostPadding(DimsHW{ postPaddingL, postPaddingT });
conv->setNbGroups(size);
IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + "/dw_conv", 1e-3);
IActivationLayer* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);
Dims kernelDim = DimsHW(1, 1);
Dims strideDim = DimsHW(1, 1);
return convBnReLU(network, weightMap, *relu->getOutput(0), lname + "/pw_conv", nbOutputMaps, kernelDim, strideDim, 1);;
}
fc
IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool->getOutput(0), OUTPUT_SIZE, weightMap["DS-CNN/fc1/weights"], weightMap["DS-CNN/fc1/biases"]);
assert(fc1);
softmax
ISoftMaxLayer *softmax = network->addSoftMax(*fc1->getOutput(0));
assert(softmax);
5. 网络结构搭建完成后,通过我们生成的wts文件,来生成引擎文件,最后进行推理。可以参考:https://github.com/wang-xinyu/tensorrtx , 大佬实现了很多常用的网络结构。
改造后的TF_MFCC, 实现 DecodeWav -> AudioSpectrogram -> Mfcc 全套流程: https://download.youkuaiyun.com/download/haiyangyunbao813/13669393
END
1) 在我们构建网络时,最好先在纸上画一下,大概画一下网络的构成,然后再一步步构建,这样当网络层比较复杂时,不至于乱。
2) 在获取权重时,注意数据的转换。因为在tf中数据格式是NHWC, 而TRT中是NCHW的,所以说我们需要把我们数据转置一下,否则推理的结果都是错的。
3)如果当我们推理的结果和tf中不一致,我们可以将某一层当最后的输出层,然后和tf中对应的节点比较,这样一层层比较,就会知道到底哪一个节点出问题了。
参考文献
1. https://github.com/huanghao128/tensorflow-mfcc 实现TF中的mfcc,可以脱离tensorflow运行。
2. https://github.com/wang-xinyu/tensorrtx 大佬实现了各种网络模型。