wav2vec2踩坑之旅5：如何制作一个transformers的数据集

本文链接：https://blog.youkuaiyun.com/starinline/article/details/115332624

本文详细记录了使用wav2vec2对thchs30中文ASR数据集进行finetune的过程中，如何创建transformers兼容的数据集，包括项目结构创建、编码、测试、问题解决（如WER缓存、jiwer库缺失、除零错误、磁盘管理）以及缩略集测试的步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

wav2vec2踩坑之旅5：如何制作一个transformers的数据集

摘要

本文记录了制作transformers数据集的主要过程，以thch30中文ASR数据集为例，模仿librispeech的格式，可用于wav2vec2模型的finetune过程。本文主要解决2个核心问题：

如何在transformers中自定义数据集？
如何使用本地数据集？

本文按照官方数据集添加向导踩坑，愿对诸君有所帮助。

注意：

因为隐私使用 ** 代替用户名，如果 ** 影响了你，请使用自己的路径，不使用～是因为transformers的程序员不计算～

1.制作SLR18 thchs30

thchs30 是经典的中文ASR数据集，在openslr上提供下载。

Step 1.1 创建项目生成代码

首先是自有数据集的制作方法，按照官方向导下载代码，创建代码结构。

#复制代码
git clone https://github.com/<your Github handle>/datasets
cd datasets
git remote add upstream https://github.com/huggingface/datasets.git
#增加新的数据集包
mkdir ./datasets/slr18
#添加说明文件
cp ./templates/README.md ./datasets/slr18/README.md
#生成数据集类文件
cp ./templates/new_dataset_script.py ./datasets/slr18/slr18.py
#使用这个数据集

Step 1.2 编码

经过一大堆的分析之后，完成了数据集脚本如下：

# coding=utf-8
# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#datasets/slr18/slr18.py

"""
bostenai 定义的.THCHS-30数据库接口
数据集URL：http://openslr.org/18/
Identifier: SLR18
Summary: A Free Chinese Speech Corpus Released by CSLT@Tsinghua University
Category: Speech
License: Apache License v.2.0 
@misc{THCHS30_2015,
  title={THCHS-30 : A Free Chinese Speech Corpus},
  author={Dong Wang, Xuewei Zhang, Zhiyong Zhang},
  year={2015},
  url={http://arxiv.org/abs/1512.01882}
}
"""

from __future__ import absolute_import, division, print_function
import os
import fnmatch
from multiprocessing import Pool
from functools import partial
import datasets

_CITATION = """\
@InProceedings{huggingface:dataset,
title = {THCHS-30},
author={bostenai, Inc.
},
year={2021}
}
"""

# 来自http://openslr.org/18/ 和我自己的理解
_DESCRIPTION = """\
THCHS30 is an open Chinese speech database published by Center for Speech and Language Technology (CSLT) at Tsinghua University. The origional recording was conducted in 2002 by Dong Wang, supervised by Prof. Xiaoyan Zhu, at the Key State Lab of Intelligence and System, Department of Computer Science, Tsinghua Universeity, and the original name was 'TCMSD', standing for 'Tsinghua Continuous Mandarin Speech Database'. The publication after 13 years has been initiated by Dr. Dong Wang and was supported by Prof. Xiaoyan Zhu. We hope to provide a toy database for new researchers in the field of speech recognition. Therefore, the database is totally free to academic users. You can cite the data using the following BibTeX entry:

@misc{THCHS30_2015,
  title={THCHS-30 : A Free Chinese Speech Corpus},
  author={Dong Wang, Xuewei Zhang, Zhiyong Zhang},
  year={2015},
  url={http://arxiv.org/abs/1512.01882}
  
本数据集封装支持本地数据源，设置环境变量 SLR18_Corpus 到解压后的数据集可以根目录
export $SLR18_Corpus=/path/to/slr18
解压后的SLR18目录的结构应该是：
data_thchs30 /
    data /
        *.wav
        *.trn
    train /
        *.wav
        *.trn
    dev
    test
    lm_phone
    lm_word
}
"""

# 指向openslr，如果需要加速请自行指向其他地址
_HOMEPAGE = "http://openslr.org/18/"

#复制了thch30的license
_LICENSE = "Apache License v.2.0"

# 这里有三种配置，text返回文本/全拼音/声韵母分离
_URLs = {
   
   
    'thch30': "https://www.openslr.org/resources/18/data_thchs30.tgz",
    'pinyin1': "https://www.openslr.org/resources/18/data_thchs30.tgz",
    'pinyin2': "https://www.openslr.org/resources/18/data_thchs30.tgz",
}

#主数据类
class Slr18Dataset(datasets.GeneratorBasedBuilder):
    """BostenAI 制作的thch30的数据集封装"""

    VERSION = datasets.Version("1.1.0")

    BUILDER_CONFIGS = [
        datasets.BuilderConfig(name="thch30", version=VERSION, description="thch30的基础数据，speech data and transcripts，text对应中文"),
        datasets.BuilderConfig(name="pinyin1", version=VERSION, description="thch30的基础数据，text对应全量拼音，声调在后的模式"),
        datasets.BuilderConfig(name="pinyin2", version=VERSION, description="thch30的基础数据，text对应声韵母分离的拼音，声调在后的模式"),
    ]

    DEFAULT_CONFIG_NAME = "thch30"  # It's not mandatory to have a default configuration. Just use one if it make sense.

    def _info(self):
        # TODO: This method specifies the datasets.DatasetInfo object which contains informations and typings for the dataset
        if self.config.name == "thch30":  # This is the name of the configuration selected in BUILDER_CONFIGS above
            features = datasets.Features(
                {
   
   
                    "id": datasets.Value("string"),           #
                    "file": datasets.Value("string"),       #音频文件