AISHELL-ASR0009-OS1 开源中文语音数据库

希尔贝壳AISHELL

已于 2022-03-11 11:08:42 修改

阅读量950

点赞数

CC 4.0 BY-SA版权

分类专栏：开源数据文章标签：语音识别人工智能

于 2022-03-09 17:28:17 首次发布

本文链接：https://blog.youkuaiyun.com/AI_SHELL/article/details/123382503

开源数据专栏收录该内容

6 篇文章

订阅专栏

本文介绍了由希尔贝壳科技有限公司发布的开源 Mandarin 语音识别数据集 AISHELL-1，该数据集包含178小时的普通话录音，涉及400位不同地区的说话人，旨在促进中文语音识别研究。数据集质量高，转写准确率超过95%，可用于构建和评估语音识别系统。AISHELL-1 是迄今为止最大的开源中文语音识别语料库，对学术界和工业界开放，有助于弥合研究与产业之间的差距。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

ABSTRACT

An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.

Index Terms— Speech Recognition, Mandarin Corpus, Open-Source Data

INTRODUCTION

Automatic Speech Recognition(ASR) has been an active research topic for several decades. Most state-of-the-art ASR systems benefit from powerful statistical models, such as Gaussian Mixture Models(GMM), Hidden Markov Models(HMM) and Deep Neural Networks(DNN) . These statistical frameworks often require a large amount of high quality data. Luckily, along with the wide adoption of smart phones, and the emerging market of various smart devices, real user data are generated world-wide and everyday, hence collecting data becomes easier than ever before. Combined with sufficient amount of real data and supervised-training, statistical approach achieves great success all over the speech industry.

However, for legal and commercial reasons, most companies are not willing to share their data with the public: large industrial datasets are often inaccessible for academic community, which leads to a divergence between research and industry. On one hand, researchers are interested in fundamental problems such as designing new model structures or beating over-fitting under limited data. Such innovations and tricks in academic papers sometimes are proven to be not effective when the dataset gets much larger, different scales ofdata lead to different stories. On the other hand, industrial developers are more concerned about building products and infrastructures that can quickly accumulate real user data, then feedback collected data into simple algorithms such as logistic regression and deep learning.

In ASR community, open-slr project is established to alleviate this problem1 . For English ASR, industrial-sized datasets such as Ted-Lium and LibriSpeech offer open platforms, for both researchers and industrial developers, to experiment and to compare system performances. Unfortunately, for Chinese ASR, the only open-source corpus is THCHS30, released by Tsinghua University, containing 50 speakers, and around 30 hours mandarin speech data . Generally speaking, Mandarin ASR systems based on small dataset like THCHS30 are not expected to perform well. In this paper, we present AISHELL-1 corpus. To authors’ limited knowledge, AISHELL-1 is by far the largest opensource Mandarin ASR corpus.

This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )