whisper笔记

wen_zhufeng

已于 2024-12-06 13:34:26 修改

阅读量2k

点赞数 18

文章标签： whisper 笔记

于 2024-12-06 11:13:58 首次发布

本文链接：https://blog.youkuaiyun.com/qq_42019881/article/details/144284656

版权

一、whisper简介

Whisper 是一种通用的自动语音识别（ASR）模型 OpenAI 开发并开源。该模型基于 68 万小时的多语言（98 种语言）和多任务的监督数据进行训练，具备多语言语音识别、语音翻译和语言识别等功能。Whisper 的架构采用简单的端到端方法，利用编码器-解码器的 Transformer 模型将音频转换为文本序列，并通过特殊标记指示不同任务。

OpenAI 强调，Whisper 在英语语音识别方面表现出接近人类水平的鲁棒性和准确性，能够有效识别多样化的口音、背景噪音和技术术语。通过开放模型和推理代码，OpenAI 希望开发者能以 Whisper 为基础，构建有用的应用程序并推动语音处理技术的进一步研究。

二、whisper可用的模型

有六种型号尺寸，其中四种为仅英文版本，提供速度和精度的权衡。以下是可用模型的名称及其相对于大型模型的近似内存需求和推理速度。以下相对速度是通过在 A100 上转录英语语音来测量的，实际速度可能会因许多因素（包括语言、语速和可用硬件）而有很大差异。

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~10x
base	74 M	`base.en`	`base`	~1 GB	~7x
small	244 M	`small.en`	`small`	~2 GB	~4x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x
turbo	809 M	N/A	`turbo`	~6 GB	~8x

不同大小的模型下载地址如下：

    "tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
    "tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
    "base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
    "base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
    "small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
    "small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
    "medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
    "medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
    "large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
    "large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
    "large-v3": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
    "large": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
    "large-v3-turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
    "turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",

下载方法：例如你想下载turbo版本的预训练模型，直接将链接https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt复制到浏览器即可下载。

三、whisper简单的用法

3.1 简单的转录

import whisper

model = whisper.load_model("turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

上述例子中的model = whisper.load_model("turbo")行代码的作用是：首先会通过turbo获取到turbo版本的预训练模型的下载路径，然后将模型文件下载到"~/.cache/whisper"路径下。所以这种方式要求你的虚拟环境能够访问外部网络，也就是你的服务器要能够访问外部网络。

3.2 详细的转录代码

import whisper

model = whisper.load_model("turbo")  # turbo是一个在硬件和准确度方面平衡的模型

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio, n_mels=model.dims.n_mels).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)  #  使用模型检测 Mel 频谱图中所说的语言。此函数返回一个包含语言及其对应概率的字典。
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

注意：上述加载模型的方式局限于你的虚拟环境能访问网络，如果访问不了外部网络的话需要先将whisper的预训练模型下载下来，然后进行加载，具体操作如下：

假设你已经将whisper中turbo版本的预训练模型下载到了目录ckpts/turbo下，那么简单转录的代码的例子应该变为下面这样

import whisper

model = whisper.load_model("ckpts/turbo")
result = model.transcribe("audio.mp3")
print(result["text"])

四、whisper-large-v3

Whisper large-v3 的架构与之前的 large 和 large-v2 模型相同，除了以下细微的差异：

频谱图输入使用 128 个 Mel 频率区间，而不是 80 个
粤语的新语言令牌
whisper-large-v3 模型在 100 万小时的弱标记音频和 400 万小时的伪标记音频上进行了训练使用 Whisper large-v2 收集的音频。该模型在此混合数据集上训练了 2.0 个时期。

whisper-large-v3 模型在各种语言上都显示出性能的改进，与 Whisper large-v2 相比错误减少了 10% 到 20% 。 whisper-large-v3 可与 pipeline 类一起使用，以转录任意长度的音频。我们可以在Huggingface镜像网站中下载 whisper-large-v3的预训练模型，然后将下面代码中的路径model_id替换为下载后的预训练模型文件的地址。

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

要转录本地音频文件，只需在调用管道时将 path 传递给您的音频文件即可：

result = pipe("audio.mp3")

可以通过将多个音频文件指定为列表并设置参数来并行转录它们：batch_size

result = pipe(["audio_1.mp3", "audio_2.mp3"], batch_size=2)