AISHELL-ASR0009-OS1 开源中文语音数据库

本文介绍了由希尔贝壳科技有限公司发布的开源 Mandarin 语音识别数据集 AISHELL-1,该数据集包含178小时的普通话录音,涉及400位不同地区的说话人,旨在促进中文语音识别研究。数据集质量高,转写准确率超过95%,可用于构建和评估语音识别系统。AISHELL-1 是迄今为止最大的开源中文语音识别语料库,对学术界和工业界开放,有助于弥合研究与产业之间的差距。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

ABSTRACT

An open-source Mandarin speech corpus called AISHELL-1 is released. It is by far the largest corpus which is suitable for conducting the speech recognition research and building speech recognition systems for Mandarin. The recording procedure, including audio capturing devices and environments are presented in details. The preparation of the related resources, including transcriptions and lexicon are described. The corpus is released with a Kaldi recipe. Experimental results implies that the quality of audio recordings and transcriptions are promising.

Index Terms— Speech Recognition, Mandarin Corpus, Open-Source Data

INTRODUCTION

Automatic Speech Recognition(ASR) has been an active research topic for several decades. Most state-of-the-art ASR systems benefit from powerful statistical models, such as Gaussian Mixture Models(GMM), Hidden Markov Models(HMM) and Deep Neural Networks(DNN) . These statistical frameworks often require a large amount of high quality data. Luckily, along with the wide adoption of smart phones, and the emerging market of various smart devices, real user data are generated world-wide and everyday, hence collecting data becomes easier than ever before. Combined with sufficient amount of real data and supervised-training, statistical approach achieves great success all over the speech industry.

However, for legal and commercial reasons, most companies are not willing to share their data with the public: large industrial datasets are often inaccessible for academic community, which leads to a divergence between research and industry. On one hand, researchers are interested in fundamental problems such as designing new model structures or beating over-fitting under limited data. Such innovations and tricks in academic papers sometimes are proven to be not effective when the dataset gets much larger, different scales ofdata lead to different stories. On the other hand, industrial developers are more concerned about building products and infrastructures that can quickly accumulate real user data, then feedback collected data into simple algorithms such as logistic regression and deep learning.

In ASR community, open-slr project is established to alleviate this problem1 . For English ASR, industrial-sized datasets such as Ted-Lium and LibriSpeech offer open platforms, for both researchers and industrial developers, to experiment and to compare system performances. Unfortunately, for Chinese ASR, the only open-source corpus is THCHS30, released by Tsinghua University, containing 50 speakers, and around 30 hours mandarin speech data . Generally speaking, Mandarin ASR systems based on small dataset like THCHS30 are not expected to perform well. In this paper, we present AISHELL-1 corpus. To authors’ limited knowledge, AISHELL-1 is by far the largest opensource Mandarin ASR corpus. 

This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )

178小时 | 178 Hours

400人中文普通话

400 speakers in the recording

语音识别实验

声纹实验

Speech & Speaker Recognition

Kaldi系统应用

merged with Kaldi system

Kaldi recipe

希尔贝壳—专注于人工智能大数据和技术的创新北京希尔贝壳科技有限公司成立于2017年,是一家专注人工智能大数据和技术服务的创新公司。针对家居、车载、机器人等语音智能产品做精准场景语音数据并输出方案。利用机器学习平台,在语音数据评测、辅助转写、数据分析、智能语音客服等场景业务建立了领先的核心技术体系。http://www.aishelltech.com/kysjcp

<think>我们正在讨论如何使用SpeechBrain的aishell_prepare.py脚本在Windows系统上处理AISHELL-1数据集。根据用户之前的上下文,我们知道AISHELL-1数据集的处理通常包括下载、解压、生成CSV映射文件等步骤。在SpeechBrain框架中,通常有一个名为`aishell_prepare.py`的脚本(或类似名称)用于准备该数据集。注意:SpeechBrain的官方示例中提供了数据准备的脚本。我们可以参考其文档和代码。步骤:1.确保已经下载了AISHELL-1数据集(如果没有,脚本可能会自动下载,但建议手动下载以避免网络问题)。2.运行`aishell_prepare.py`脚本,指定数据集的路径和输出CSV文件的路径。由于用户提到是在Windows系统上,所以需要注意路径的写法(使用反斜杠或正斜杠)以及环境设置。具体步骤:1.下载数据集:从http://www.openslr.org/33/下载data_aishell.tgz(约15GB)。2.将下载的压缩包放在一个目录下,例如:`D:\datasets\AISHELL-1`。3.克隆SpeechBrain的仓库(如果尚未安装SpeechBrain,可以通过pip安装,但脚本可能在示例中):```git clonehttps://github.com/speechbrain/speechbrain```4.进入AISHELL-1数据准备的脚本目录(通常在`speechbrain/recipes/AISHELL-1`下):```cdspeechbrain\recipes\AISHELL-1```5.运行准备脚本(假设脚本名为`prepare_data.py`或`aishell_prepare.py`):```python prepare_data.py--data_folderD:\datasets\AISHELL-1 --output_folderD:\datasets\AISHELL-1\processed```参数说明:-`--data_folder`:存放原始数据(data_aishell.tgz)的目录。-`--output_folder`:处理后的输出目录(将存放CSV文件等)。6.脚本会执行以下操作:-解压数据(如果尚未解压)。-读取转录文件,并生成三个CSV文件:train.csv,dev.csv, test.csv(分别对应训练集、开发集、测试集)。7.生成的CSV文件格式通常包含以下列(示例):```ID,duration,wav, start, stop, speaker, transcription```其中:-ID:utteranceid-duration:音频时长(秒)-wav:音频文件的完整路径-start,stop:如果使用片段,这里可能是开始和结束时间(但AISHELL-1通常是整段音频,所以start=0,stop=-1表示整个音频)- speaker:说话人ID-transcription:文本内容注意:在Windows上运行可能会遇到路径分隔符问题(脚本中通常使用`os.path`模块处理路径,所以一般不会有问题,但最好确保路径正确)。如果遇到问题,可以尝试:-使用绝对路径。-检查Python版本(建议Python3.6以上)。-确保依赖包安装(如SpeechBrain,tqdm等)。引用[1]中提到了wenet的处理脚本,其中使用了`tools/filter_scp.pl`,但SpeechBrain的处理方式不同,它使用Python脚本生成CSV。引用[3]中的Kaldi特征提取步骤(如MFCC)在SpeechBrain中不需要,因为SpeechBrain会在训练时动态提取特征。因此,在SpeechBrain中,我们只需要生成CSV文件,然后在训练配置文件中指定这些CSV文件即可。相关问题:1.如何修改SpeechBrain的aishell_prepare.py脚本来支持自定义数据集?2.在Windows上运行aishell_prepare.py脚本时遇到路径错误,如何解决?3.SpeechBrain训练AISHELL-1的示例配置文件在哪里?</think>### 使用 SpeechBrain 的 `aishell_prepare.py` 在 Windows 上处理 AISHELL-1 数据集 #### 步骤 1:准备环境 1. **安装依赖**: ```bash pip install speechbrain torchaudio ``` 2. **下载数据集**: - 从 [OpenSLR](http://www.openslr.org/33/) 手动下载 `data_aishell.tgz`(约15GB) - 创建目录结构: ```bash mkdir D:\AISHELL-1 # 将下载的压缩包放入 D:\AISHELL-1\data_aishell.tgz ``` #### 步骤 2:获取并运行脚本 1. **从 SpeechBrain 获取脚本**: - 从官方仓库下载 [`aishell_prepare.py`](https://github.com/speechbrain/speechbrain/blob/develop/recipes/AISHELL-1/ASR/transformer/prepare.py) - 保存到 `D:\AISHELL-1\prepare.py` 2. **运行脚本**: ```bash python prepare.py --data_folder D:\AISHELL-1 --output_folder D:\AISHELL-1\processed ``` - **参数说明**: - `--data_folder`: 原始数据集路径 - `--output_folder`: 处理后的输出路径 - `--splits`: 可指定分割比例(如 `--splits train,dev,test`) #### 步骤 3:脚本处理过程 脚本将执行以下操作[^1][^2]: 1. **解压数据**: ```python # 示例代码片段 with tarfile.open(tar_file) as tar: tar.extractall(path=data_folder) ``` 2. **修复文本编码**: - 将转录文件转换为 UTF-8 格式 3. **生成 CSV 文件**: - 创建三个关键文件: - `train.csv`: 训练集映射 - `dev.csv`: 验证集映射 - `test.csv`: 测试集映射 - **CSV 格式示例**: ``` ID, duration, wav, transcription BAC009S0002W0122, 4.5, D:\AISHELL-1\wav\S0002\BAC009S0002W0122.wav, 他 也 是 一 位 经 济 学 家 ``` #### 步骤 4:验证输出 检查输出目录: ```bash dir D:\AISHELL-1\processed ``` 应包含: - `wav/`: 音频文件目录 - `train.csv`, `dev.csv`, `test.csv`: 数据集分割映射文件 - `transcript.txt`: 完整转录文本 #### Windows 特有注意事项 1. **路径问题**: - 使用反斜杠 `\` 或双反斜杠 `\\` - 示例:`D:\\AISHELL-1\\processed` 2. **权限问题**: - 以管理员身份运行 PowerShell/CMD 3. **长路径支持**: - 在注册表中启用长路径支持(路径 > 260字符) 4. **替代手动下载**: 在脚本中添加自动下载逻辑: ```python # 在 prepare.py 中添加 import urllib.request urllib.request.urlretrieve("http://www.openslr.org/resources/33/data_aishell.tgz", "data_aishell.tgz") ``` #### 常见错误处理 | 错误类型 | 解决方案 | |---------|---------| | `FileNotFoundError` | 检查路径大小写(Windows 路径不区分大小写) | | 编码错误 | 在脚本开头添加 `# -*- coding: utf-8 -*-` | | 权限拒绝 | 关闭杀毒软件临时文件夹监控 | | 解压失败 | 使用 7-Zip 手动解压后运行脚本 | > **关键提示**:SpeechBrain 的数据加载器会直接读取生成的 CSV 文件进行训练[^1],格式兼容其内置的 `DynamicItemDataset`。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值