基于 UltraEval-Audio 的语音大模型评估——自定义数据集和评估实操-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_42951560/article/details/146458757

在这里插入图片描述

引言

UltraEval-Audio 是一个很好用的语音大模型评估工具。官网给的自定义数据集和评估的教程有点简单，这里我进行了实践，并发布了一个详细的版本。

流程

想要在自己的数据集上评估，需要三步：

评估任务构建：即数据集是用来干嘛的。是来评估语音识别还是情感识别的准确率的？内置有：自动语音识别，语音/音频问答，情感识别等，你可以在 registry/eval_task 中了解详细情况。
数据集构建：结构化组织数据源文件。
执行评估：运行 main.py 评估。

评估任务构建

这里我想做一个自动语音识别的评估任务。registry/eval_task 内置的有语音识别，即 registry/eval_task/asr.yaml，那我就直接拿来用了。

asr: # 这是 `registry/eval_task/asr.yaml` 中的原生评估任务的配置
  class: audio_evals.base.EvalTaskCfg
  args:
    dataset: KeSpeech
    prompt: asr
    model: qwen-audio
    post_process: ['json_content']
    evaluator: wer
    agg: wer

asr-xuechao: # 这是我添加的评估任务的配置
  class: audio_evals.base.EvalTaskCfg
  args:
    dataset: xuechao               # 指定默认评估数据集
    prompt: asr-xuechao            # 指定 prompt 所在的 yaml 配置文件路径
    model: qwen-audio-chat-offline # 指定默认评估模型
    post_process: ['json_content'] # 指定后处理方式
    evaluator: wer                 # 指定数据集中单个样本的评估指标
    agg: wer                       # 指定整个数据集的聚合（平均）评估指标

在这里插入图片描述

registry/eval_task/asr.yaml 中已经定义好了一些做语音识别评估任务的配置，这里我又新建了一个名为 asr-xuechao 的配置项。并按照上图评估流程指定：

dataset: 数据集
prompt：提示词
model: 模型
post_process: 后处理方式
evaluator: 数据集中单个样本的评估指标
agg: 整个数据集的聚合（平均）评估指标

评估任务的配置文件构建好了，接下来就需要分别写数据集，提示词，模型，后处理方式，评估器以及聚合器的配置了，所有配置均为 yaml 格式。

数据集的配置文件 registry/dataset/xuechao.yaml

xuechao: # 数据集调用名称
  class: audio_evals.dataset.dataset.JsonlFile # 处理数据集的类，默认即可
  args:
    default_task: asr-xuechao                      # 评估任务，用上一步我们新建的 `asr-xuechao`
    f_name: data/xuechao/xuechao.jsonl             # 自定义数据集的 jsonl 文件路径
    ref_col: Transcript                            # 定义数据集的 jsonl 文件中的参考答案的字段

提示词的配置文件 registry/prompt/asr-xuechao.yaml

asr-xuechao:
  class: audio_evals.prompt.base.Prompt
  args:
    template:
      - role: user
        contents:
          - type: audio
            value: "{{WavPath}}"
          - type: text
            value: "listen the audio, output the audio content with format {\"content\": \"\"}"

模型的配置文件使用内置的 registry/model/offline.yaml 里面定义的 qwen-audio-chat-offline

qwen-audio-chat-offline:
   class: audio_evals.models.offline_model.OfflineModel
   args:
     is_chat: True
     path: Qwen/Qwen-Audio-Chat
     sample_params:
       do_sample: false
       max_new_tokens: 256
       min_new_tokens: 1
       length_penalty: 1.0
       num_return_sequences: 1
       repetition_penalty: 1.0
       use_cache: True

后处理方式也使用内置的 registry/process/base.yaml 里面定义的 json_content

json_content:
  class: audio_evals.process.base.ContentExtract
  args: {}

评估器使用内置的 registry/evaluator/common.yaml 里面的 wer

wer:
  class: audio_evals.evaluator.wer.WER
  args:
    ignore_case: true

聚合器使用内置的 registry/agg/naive.yaml 里面的 wer

cer:
  class: audio_evals.agg.base.CER
  args: {}

到底为止，评估任务都构建好了，接下来我们开始构建评估要用的数据集。

数据集构建

我在 data/xuechao/xuechao.jsonl 文件中定义了三条数据集样例

{"WavPath": "data/xuechao/wavs/1.wav", "Transcript": "When did your son start feeling unwell?"}
{"WavPath": "data/xuechao/wavs/2.wav", "Transcript": "He had a fever and chills since yesterday but his right arm started hurting more to day a few days ago he hurt his index finger while playing in the garden."}
{"WavPath": "data/xuechao/wavs/3.wav", "Transcript": "I see his temperature is quite high and i notice redness and swelling along his arm has the wound on his finger worsened."}

其中，WavPath 指定了音频文件的路径，Transcript 字段是标准答案，用来计算指标 wer。

执行评估

python main.py --dataset xuechao --model qwen-audio-chat-offline

评估后的结果保存在 res/qwen-audio-chat-offline/xuechao/ 目录下。

数据集级别的评估结果（总的评估结果）

{'wer(%)': 29.508196721311474, 'fail_rate(%d)': 0.0}

每条数据样例的评估结果（总共有三条数据，分别报告了输入状态、推理状态、后处理以及最终的评估结果）

{"type": "prompt", "id": 0, "data": {"content": [{"role": "user", "contents": [{"type": "audio", "value": "data/xuechao/wavs/1.wav"}, {"type": "text", "value": "listen the audio, output the audio content with format {\"content\": \"\"}"}]}]}}
{"type": "inference", "id": 0, "data": {"content": "OK. This is the audio content: \"When did your son start feeling unwell?\"."}}
{"type": "post_process", "id": 0, "data": {"content": "OK. This is the audio content: \"When did your son start feeling unwell?\"."}}
{"type": "eval", "id": 0, "data": {"pred": "OK. This is the audio content: \"When did your son start feeling unwell?\".", "ref": "When did your son start feeling unwell?", "wer%": 85.71428571428571}}

{"type": "prompt", "id": 1, "data": {"content": [{"role": "user", "contents": [{"type": "audio", "value": "data/xuechao/wavs/2.wav"}, {"type": "text", "value": "listen the audio, output the audio content with format {\"content\": \"\"}"}]}]}}
{"type": "inference", "id": 1, "data": {"content": "OK. This is the audio content: \"he had a fever and chills since yesterday but his right arm started hurting more to day a few days ago he hurt his index finger while playing in the garden\"."}}
{"type": "post_process", "id": 1, "data": {"content": "OK. This is the audio content: \"he had a fever and chills since yesterday but his right arm started hurting more to day a few days ago he hurt his index finger while playing in the garden\"."}}
{"type": "eval", "id": 1, "data": {"pred": "OK. This is the audio content: \"he had a fever and chills since yesterday but his right arm started hurting more to day a few days ago he hurt his index finger while playing in the garden\".", "ref": "He had a fever and chills since yesterday but his right arm started hurting more to day a few days ago he hurt his index finger while playing in the garden.", "wer%": 19.35483870967742}}

{"type": "prompt", "id": 2, "data": {"content": [{"role": "user", "contents": [{"type": "audio", "value": "data/xuechao/wavs/3.wav"}, {"type": "text", "value": "listen the audio, output the audio content with format {\"content\": \"\"}"}]}]}}
{"type": "inference", "id": 2, "data": {"content": "OK. This is the audio content: \"i see his temperature is quite high and i notice redness and swelling along his arm has the wound on his finger worsened\"."}}
{"type": "post_process", "id": 2, "data": {"content": "OK. This is the audio content: \"i see his temperature is quite high and i notice redness and swelling along his arm has the wound on his finger worsened\"."}}
{"type": "eval", "id": 2, "data": {"pred": "OK. This is the audio content: \"i see his temperature is quite high and i notice redness and swelling along his arm has the wound on his finger worsened\".", "ref": "I see his temperature is quite high and i notice redness and swelling along his arm has the wound on his finger worsened.", "wer%": 26.08695652173913}}