mxnet speech_recognition踩坑记

mxnet的example下有speech_recognition,基于百度的 DeepSpeech2,尝试用该工程进行语音识别的训练

https://github.com/apache/incubator-mxnet/tree/master/example/speech_recognition

该工程需要自己编译带 WarpCTC 的 mxnet 版本

我首先使用XUbuntu19.10操作系统,在编译warp-ctc时遇到如下报错:

/usr/local/cuda/include/crt/host_config.h:138:2: error: #error -- unsupported GNU version! gcc versions later than 8 are not supported!
我这个电脑上有GPU,报错说不支持版本高于8的gcc,XUbuntu的19.10的gcc版本为9.2.1

然后,我就只能在XUbuntu18.04版本上进行编译,在编译mxnet时报如下错误:

CMake 3.13 or higher is required.  You are running version 3.10.2
又嫌弃cmake版本太低,我再次换操作系统

操作系统XUbuntu19.10的cmake版本为3.13.4,gcc版本为8.3.0,看来符合上面苛刻的条件

编译安装warp-ctc工程

git clone https://github.com/baidu-research/warp-ctc.git
cd warp-ctc
mkdir build
cd build
cmake ../
make
sudo make install

编译mxnet

https://mxnet.apache.org/get_started/ubuntu_setup

sudo apt install ninja-build ccache libopenblas-dev libopencv-dev
git clone --recursive https://github.com/apache/incubator-mxnet.git mxnet
cd mxnet

修改文件make/config.mk,取消注释下面两行

WARPCTC_PATH = $(HOME)/warp-ctc
MXNET_PLUGINS += plugin/warpctc/warpctc.mk

如果要添加CUDA支持,设置

USE_CUDA = 1
USE_CUDNN = 1

开始编译

make -j8

支持GPU版本,参考命令

make -j8 USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

安装到系统中

cd python
pip install --user -e .

安装配套

pip install mxboard
pip install soundfile

先下载两个音频文件,训练试一下:

https://github.com/samsungsds-rnd/deepspeech.mxnet/tree/master/Libri_sample

把上面链接整个目录都下载到目录example/speech_recognition/下,并新建文件Libri_sample.json,内容如下:

{"duration": 2.9450625, "text": "and sharing her house which was near by", "key": "./Libri_sample/3830-12531-0030.wav"}
{"duration": 3.94, "text": "we were able to impart the information that we wanted", "key": "./Libri_sample/3830-12529-0005.wav"}

执行如下命令:

mkdir checkpoints
mkdir log
python main.py --configfile default.cfg

如果报错:libwarpctc.so: cannot open shared object file: No such file or directory

则执行命令 :

export LD_LIBRARY_PATH=/usr/local/lib

如果上面编译mxnet的时候,没有开启支持GPU,会报错MXNetError: Compile with USE_CUDA=1 to enable GPU usage,

如果是想在CPU上进行训练,需要修改default.cfg,

context = cpu0

训练完成后就可以预测了,复制default.cfg文件为predict.cfg,把

mode = train
改为

mode = predict
然后执行命令:

python main.py --configfile predict.cfg

我重新在我的神船笔记本(带RTX2060)上训练,预测结果为:

[    INFO][2020/01/13 16:54:22.581] label: we were able to impart the information that we wanted
[    INFO][2020/01/13 16:54:22.581] pred : we were able to impart the information that we wanted , cer: 0.000000 (distance: 0/ label length: 53)
[    INFO][2020/01/13 16:54:22.582] label: and sharing her house which was near by
[    INFO][2020/01/13 16:54:22.582] pred : and sharing her house which was near by , cer: 0.000000 (distance: 0/ label length: 39)

完全正确,好激动,真的可以预测

 

下面开始玩大的,多下载一些训练数据,准备好250G的磁盘空间:

http://www.openslr.org/resources/12/dev-clean.tar.gz
#http://www.openslr.org/resources/12/dev-other.tar.gz
http://www.openslr.org/resources/12/test-clean.tar.gz
#http://www.openslr.org/resources/12/test-other.tar.gz
http://www.openslr.org/resources/12/train-clean-100.tar.gz
#http://www.openslr.org/resources/12/train-clean-360.tar.gz
#http://www.openslr.org/resources/12/train-other-500.tar.gz

然后全部解压缩,

tar xvf dev-clean.tar.gz
#tar xvf dev-other.tar.gz
tar xvf test-clean.tar.gz
#tar xvf test-other.tar.gz
tar xvf train-clean-100.tar.gz
#tar xvf train-clean-360.tar.gz
#tar xvf train-other-500.tar.gz

然后转码为wav,执行example/speech_recognition/flac_to_wav.sh

./flac_to_wav.sh

这个命令会把子目录下的所有flac文件转码为wav,保存到同目录下

然后下载工程 https://github.com/baidu-research/ba-dls-deepspeech.git ,没办法,还是要用到该工程下的create_desc_json.py文件:

python create_desc_json.py ~/LibriSpeech/train-clean-100 train_corpus.json
python create_desc_json.py ~/LibriSpeech/dev-clean validation_corpus.json
python create_desc_json.py ~/LibriSpeech/test-clean test_corpus.json

然后稍微修改一下deepspeech.cfg,

train_json = train_corpus.json
test_json = test_corpus.json
val_json = validation_corpus.json

开始训练

mkdir checkpoints
mkdir log
python main.py --configfile deepspeech.cfg

我测试感觉这个训练极耗内存,我神船的16G ddr4根本扛不住,运行起来就会报“已杀死”,所以只好想办法增加了64G的虚拟内存

sudo dd if=/dev/zero of=swapfile bs=64M count=1024
sudo mkswap swapfile
sudo swapon swapfile

看起来差不多虚拟内存也要耗费46G左右,加上本身的16G内存,估计共需要64G以上的内存才能运行

把该配置文件按照上面的方法,修改为predict.cfg,可以尝试下语音识别的效果

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值