基于TensorRT C++ API 实现 tensorflow examples 中 speech_commands

最新推荐文章于 2025-04-27 15:21:33 发布

原创最新推荐文章于 2025-04-27 15:21:33 发布

· 639 阅读

1 ·

版权

文章标签：

#tensorflow

CV 专栏收录该内容

31 篇文章

订阅专栏

语音关键词分类 speech_commands：https://github.com/tensorflow/docs/blob/master/site/en/r1/tutorials/sequences/audio_recognition.md#running_the_model_in_an_android_app

实现过程

1. 查看模型结构(部分内容)

2. 前部分音频处理逻辑顺序: wave_data -> DecodeWav -> AudioSpectrogram -> Mfcc ，所以说我们需要不先依赖tf来实现这么逻辑处理。这部分内容参考：https://github.com/huanghao128/tensorflow-mfcc，这位大哥实现了tf-mfcc,所以我拿过来稍微改造了一下就可以直接使用了，再次非常感谢。

3. 获取权重( 注意数据格式 tf:NHWC trt: NCHW )

# 写这段代码的时候，只有上帝和我知道它是干嘛的
# 现在，只有上帝知道
# @File : pb2wts.py
# @Time : 2020/11/26 22:58 
# @Author : J.
# @desc :  pb 导出权重,生成wts文件

import tensorflow as tf
from tensorflow.python.platform import gfile
import struct
import torch
from tensorflow.python.framework import tensor_util

# path to your .pb file
GRAPH_PB_PATH = './model/kws.pb'
GRAPH_WTS_PATH = './model/kws.wts'

with tf.Session() as sess:
    print("load graph")
    with gfile.FastGFile(GRAPH_PB_PATH, 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())
        sess.graph.as_default()
        tf.import_graph_def(graph_def, name='')
        graph_nodes = [n for n in graph_def.node]
wts = [n for n in graph_nodes if n.op == 'Const']

dict = {}
for n in wts:
    v = n.attr['value']
    print(n.name)
    ar = tensor_util.MakeNdarray(v.tensor)
    dict[n.name] = torch.Tensor(ar)

f = open(GRAPH_WTS_PATH, 'w')
f.write("{}\n".format(len(dict.keys())))
for k, v in dict.items():
    print('key: ', k)
    print('value: ', v.shape)

    if v.ndim == 4:  # tf:NHWC  trt:NCHW
        v = v.transpose(3, 0).transpose(2, 1).transpose(3, 2)
        vr = v.reshape(-1).cpu().numpy()
    elif v.ndim == 2:
        v = v.transpose(1, 0)
        vr = v.reshape(-1).cpu().numpy()
    else:
        vr = v.reshape(-1).cpu().numpy()

    f.write("{} {}".format(k, len(vr)))
    for vv in vr:
        f.write(" ")
        f.write(struct.pack(">f", float(vv)).hex())
    f.write("\n")

4. 获取音频的MFCC当模型的输入，然后一层一层搭建就可以了。

Reshape:

ILayer* reshape(INetworkDefinition* network, ITensor& input, Dims dims) {
	IShuffleLayer *shuffleLayer = network->addShuffle(input);
	assert(shuffleLayer);
	shuffleLayer->setReshapeDimensions(dims);
	return shuffleLayer;
}

conv + bn + relu

ILayer* convBnReLU(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname,
	int nbOutputMaps, Dims kernelSize, Dims strideSize, int groupSize) {

	Dims d = input.getDimensions();
	Weights emptywts{ DataType::kFLOAT, nullptr, 0 };
	IConvolutionLayer* conv = network->addConvolutionNd(input, nbOutputMaps, DimsHW{ kernelSize.d[0] , kernelSize.d[1] }, weightMap[lname + "/weights"], weightMap[lname + "/biases"]);
	assert(conv);
	conv->setStrideNd(DimsHW{ strideSize.d[0], strideSize.d[1] });

	int padSizeL = paddingSize(d.d[1], kernelSize.d[0], strideSize.d[0]);
	int padSizeT = paddingSize(d.d[2], kernelSize.d[1], strideSize.d[1]);
	int postPaddingL = ceil(padSizeL / 2.0);
	int postPaddingT = ceil(padSizeT / 2.0);
	int prePaddingL = padSizeL - postPaddingL;
	int prePaddingT = padSizeT - postPaddingT;
	if (prePaddingL > 0 || prePaddingT > 0)
		conv->setPrePadding(DimsHW{ prePaddingL, prePaddingT });
	if (postPaddingL > 0 || postPaddingT > 0)
		conv->setPostPadding(DimsHW{ postPaddingL, postPaddingT });

	conv->setNbGroups(groupSize);

	IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname, 1e-3);

	IActivationLayer* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);
	assert(relu);
	return relu;
}

depthwiseConvolutionNd

ILayer* depthwiseConvolutionNd(INetworkDefinition* network, std::map<std::string, Weights>& weightMap, ITensor& input, std::string lname,
	int nbOutputMaps, Dims kernelSize, Dims strideSize) {
	Weights emptywts{ DataType::kFLOAT, nullptr, 0 };
	Dims d = input.getDimensions();
	int size = d.d[0];
	IConvolutionLayer* conv = network->addConvolutionNd(input, size, DimsHW{ kernelSize.d[0], kernelSize.d[1] }, weightMap[lname + "/dw_conv/depthwise_weights"], weightMap[lname + "/dw_conv/biases"]);
	conv->setStrideNd(DimsHW{ strideSize.d[0], strideSize.d[1] });

	int padSizeL = paddingSize(d.d[1], kernelSize.d[0], strideSize.d[0]);
	int padSizeT = paddingSize(d.d[2], kernelSize.d[1], strideSize.d[1]);
	int postPaddingL = ceil(padSizeL / 2.0);
	int postPaddingT = ceil(padSizeT / 2.0);
	int prePaddingL = padSizeL - postPaddingL;
	int prePaddingT = padSizeT - postPaddingT;
	if (prePaddingL > 0 || prePaddingT > 0)
		conv->setPrePadding(DimsHW{ prePaddingL, prePaddingT });
	if (postPaddingL > 0 || postPaddingT > 0)
		conv->setPostPadding(DimsHW{ postPaddingL, postPaddingT });

	conv->setNbGroups(size); 

	IScaleLayer* bn = addBatchNorm2d(network, weightMap, *conv->getOutput(0), lname + "/dw_conv", 1e-3);
	IActivationLayer* relu = network->addActivation(*bn->getOutput(0), ActivationType::kRELU);

	Dims kernelDim = DimsHW(1, 1);
	Dims strideDim = DimsHW(1, 1);
	return convBnReLU(network, weightMap, *relu->getOutput(0), lname + "/pw_conv", nbOutputMaps, kernelDim, strideDim, 1);;
}

IFullyConnectedLayer* fc1 = network->addFullyConnected(*pool->getOutput(0), OUTPUT_SIZE, weightMap["DS-CNN/fc1/weights"], weightMap["DS-CNN/fc1/biases"]);
assert(fc1);

softmax

ISoftMaxLayer *softmax = network->addSoftMax(*fc1->getOutput(0));
assert(softmax);

5. 网络结构搭建完成后，通过我们生成的wts文件，来生成引擎文件，最后进行推理。可以参考：https://github.com/wang-xinyu/tensorrtx , 大佬实现了很多常用的网络结构。

改造后的TF_MFCC, 实现 DecodeWav -> AudioSpectrogram -> Mfcc 全套流程: https://download.youkuaiyun.com/download/haiyangyunbao813/13669393

END

1) 在我们构建网络时，最好先在纸上画一下，大概画一下网络的构成，然后再一步步构建，这样当网络层比较复杂时，不至于乱。

2) 在获取权重时，注意数据的转换。因为在tf中数据格式是NHWC, 而TRT中是NCHW的，所以说我们需要把我们数据转置一下，否则推理的结果都是错的。

3）如果当我们推理的结果和tf中不一致，我们可以将某一层当最后的输出层，然后和tf中对应的节点比较，这样一层层比较，就会知道到底哪一个节点出问题了。

参考文献

1. https://github.com/huanghao128/tensorflow-mfcc 实现TF中的mfcc，可以脱离tensorflow运行。

2. https://github.com/wang-xinyu/tensorrtx 大佬实现了各种网络模型。