Whisper-Medusa 使用教程-优快云博客

Whisper-Medusa 使用教程

1. 项目介绍

Whisper-Medusa 是基于 Whisper 模型的一个扩展，旨在通过引入多头解码机制来提高自动语音识别（ASR）的速度和效率。Whisper 是一个先进的编码器-解码器模型，用于语音转录和翻译。Whisper-Medusa 通过预测每次迭代中的多个标记（token），对 Whisper 进行了优化，从而在保持较高准确率的同时，显著提升了速度。

2. 项目快速启动

在开始之前，请确保你已经安装了 Python 3.11 和相关依赖。

环境准备

conda create -n whisper-medusa python=3.11 -y
conda activate whisper-medusa
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2

克隆项目并安装

git clone https://github.com/aiola-lab/whisper-medusa.git
cd whisper-medusa
pip install -e .

使用模型进行推理

import torch
import torchaudio
from whisper_medusa import WhisperMedusaModel
from transformers import WhisperProcessor

model_name = "aiola/whisper-medusa-linear-libri"
model = WhisperMedusaModel.from_pretrained(model_name)
processor = WhisperProcessor.from_pretrained(model_name)

path_to_audio = "path/to/audio.wav"
SAMPLING_RATE = 16000
language = "en"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_speech, sr = torchaudio.load(path_to_audio)
if input_speech.shape[0] > 1:
    input_speech = input_speech.mean(dim=0, keepdim=True)
if sr != SAMPLING_RATE:
    input_speech = torchaudio.transforms.Resample(sr, SAMPLING_RATE)(input_speech)

input_features = processor(input_speech.squeeze(), return_tensors="pt", sampling_rate=SAMPLING_RATE).input_features
input_features = input_features.to(device)
model = model.to(device)

model_output = model.generate(input_features, language=language)
predict_ids = model_output[0]
pred = processor.decode(predict_ids, skip_special_tokens=True)

print(pred)