Abstract
This paper presents a far-field text-dependent speaker verification database named HI-MIA. We aim to meet the data requirement for far-field microphone array based speaker verification since most of the publicly available databases are single channel close-talking and text-independent. The database contains recordings of 340 people in rooms designed for the far-field scenario. Recordings are captured by multiple microphone arrays located in different directions and distance to the speaker and a high-fidelity close-talking microphone. Besides, we propose a set of end-to-end neural network based baseline systems that adopt single-channel data for training. Moreover, we propose a testing background aware enrollment augmentation strategy to further enhance the performance. Results show that the fusion systems could achieve 3.29% EER in the far-field enrollment far field testing task and 4.02% EER in the close-talking enrollment and far-field testing task.
Index Terms: open source database, text-dependent, multichannel, far-field, speaker verification
Introduction
The goal of speaker verification is to verify whether the testing audio is indeed uttered by the target speaker. Recently, many open and free speech databases with thousands of speakers become publicly available. Most of the databases (e.g. AISHELL2, Librispeech, Voxceleb1&2 ) are recorded in a close-talking environment without noise. Nevertheless, this recording environment does not match with the farfield scenarios in real world smart home or Internet of Things applications. Speaker verification under noisy and reverberation conditions is one of the challenging topics. The performance of speaker verification systems degrades significantly in the far-field condition where the speech is recorded in an unknown direction and distance (usually between 1m-10m). This problem also occurs in speech recognition. Although we have simulation toolkits to convert the close-talking speech to simulated far-field speech, there still exists significant channel mismatch comparing to the real recordings. Moreover, the goal of the front-end processing methods are different in speaker verification and speech recognition. Therefore, it is essential to develop an open and publicly available far-field multi-channel speaker verification database.
Various approaches considering the single-channel microphone or multi-channel microphone array have been proposed to reduce the impact of the reverberation and environmental noise. Those approaches address the problem at different levels of the text-independent automatic speaker verification (ASV) system. At the signal level, linear prediction inverse modulation transfer function and weighted prediction error (WPE) methods are used for dereverberation. Deep neural network (DNN) based denoising methods for single-channel speech enhancement and beamforming for multi-channel speech enhancement are explored for ASV system under complex environments. At the feature level, sub-band Hilbert envelopes based features , warped minimum variance distortionless response (MVDR) cepstral coefficients , power-normalized cepstral coefficients (PNCC) and DNN bottleneck features have been applied to ASV system to suppress the adverse impacts of reverberation and noise. At the model level, reverberation matching with multi-condition training models has achieved good performance.
Deep learning promotes the application of speaker verification technology greatly. The recognition system has been significantly improved from the traditional i-vector method to the DNN-based x-vector method. Recently, CNN-based neural networks also perform well in the speaker verification task. However, both traditional methods and deep learning approaches are data-driven methods that need large amounts of training data. The lack of real world collected microphone array based far field data limits the development and application of far field speaker verification technology in different scenarios.
In this paper, we introduce a database named HI-MIA containing recordings of wake-up words under the smart home scenario. This database covers 340 speakers and a wide range of channels from close-talking microphones to multiple far-field microphone arrays. It can be used for far-field wake-up word recognition, far-field speaker verification and speech enhancement. In addition, we provide a set of baseline systems that are trained with the far-field speaker verification data in the transfer learning manner. With the model pre-trained by a large scale simulated far-field data, the system performs well on both far-field enrollment with far-field testing and close-talking enrollment with far-field testing tasks. With the help of enrollment data augmentation, the performance of close-talking enrollment have been further improved.
AISHELL-WakeUp-1 database contains 1,561.12 hours speech data, including 3,936,003 wake-up words speech files.
• Database language: Chinese and English
• Recording area: China
• Wake-up words for recording: “Hi mia” and the Chinese of “你好,米雅”
• Speakers: 254 participants
• Environment: Real home environment
• Device setup: 7 different positions are set for recording, including:
1) Six 16-channel circular microphone arrays (16kHz,16bit) for the far-field recording;
2) One Hi-Fi microphone for the close-talk recording (44.1kHz,16bit).
AISHELL-WakeUp-1 database was transcribed by the professional speech annotators with high QA process, and the accuracy rate of word is 100%, which could be used in research of voiceprint recognition, wake-up words recognition and so on.
AISHELL-WakeUp-1语音数据库共唤醒词语音3936003条,1561.12小时。录音语言,中文和英文;录音地区,中国。录音文本为“你好,米雅” “hi, mia”唤醒词。邀请254名发言人参与录制。录制过程在真实家居环境中,设置7个录音位,使用6个圆形16路PDM麦克风阵列录音板做远讲拾音(16kHz,16bit)、1个高保真麦克风做近讲拾音(44.1kHz,16bit)。此数据库经过专业语音校对人员转写标注,并通过严格质量检验,字正确率100%。可用于声纹识别、语音唤醒识别等研究使用。
1561.12小时 | 1561.12 Hours
254人中英文
254 speakers in the recording
语音识别实验
声纹实验
Speech & Speaker Recognition evaluation
Kaldi系统应用
merged with Kaldi system
Kaldi recipe