基于Pyannote的中文声纹识别

alpha-soso

已于 2024-02-06 18:26:00 修改

阅读量4.1k

点赞数 25

分类专栏：语音识别文章标签：语音识别人工智能

于 2024-01-31 18:08:35 首次发布

本文链接：https://blog.youkuaiyun.com/mr_lio/article/details/135959266

版权

语音识别专栏收录该内容

2 篇文章

订阅专栏

声纹识别是一种生物识别技术，也称为speaker recognization，其中主要包含语音分割以及声纹相似度对比。该技术通过将声信号转换成电信号，再使用人工智能技术进行识别，不同的任务和应用会使用不同的声纹识别技术，例如会议纪要、图文转播、刑案侦察等。

环境准备

pyannote-audio是，github地址为https://github.com/pyannote/pyannote-audio

我使用的是pyannote-audio 3.1.0 + wespeaker-voxceleb-resnet34-LM + segmentation-3.0

项目是一直在持续更新的，要根据自己的实际情况选择正确的模型和环境

其中依赖包要求如下：

asteroid-filterbanks >=0.4
einops >=0.6.0
huggingface_hub >= 0.13.0
lightning >= 2.0.1
omegaconf >=2.1,<3.0
pyannote.core >= 5.0.0
pyannote.database >= 5.0.1
pyannote.metrics >= 3.2
pyannote.pipeline >= 3.0.1
pytorch_metric_learning >= 2.1.0
rich >= 12.0.0
semver >= 3.0.0
soundfile >= 0.12.1
speechbrain >= 0.5.14
tensorboardX >= 2.6
torch >= 2.0.0
torch_audiomentations >= 0.11.0
torchaudio >= 2.0.0
torchmetrics >= 0.11.0

我使用的是python 3.8，在环境准备时，其实我只执行下面一个命令

# 其他依赖会自动下载
pip install pyannote.audio==3.1.0

语音数据准备

日常需要分析的语音往往是将视频中的，且语音模型都对其输入有一定的数据要求，例如采样率为16000，单声道等。

视频是在Youtube上下载的小猪佩奇第一季第九集

使用ffmpeg工具将视频文件中语音提取、加工

# 将音频提取出来，指定采样率为16000，单声道
ffmpeg -y -i demo.mp4 -vn -ss 00:00:00 -t 300 -ar 16000 -ac 1 -f wav audio_all.wav

# 提取一段小猪佩奇的声音，时长为5秒
ffmpeg -y -i demo.mp4 -vn -ss 00:00:02 -t 5 -ar 16000 -ac 1 -f wav audio_peppa_1.wav
# 再次提取一段小猪佩奇的声音，时长为5秒
ffmpeg -y -i demo.mp4 -vn -ss 00:00:07 -t 3 -ar 16000 -ac 1 -f wav audio_peppa_2.wav

# 提取一段旁白的声音，时长为3秒
ffmpeg -y -i p_1_9.mp4 -vn -ss 00:00:23 -t 3 -ar 16000 -ac 1 -f wav audio_aside

在这里插入图片描述
若服务器上未安装ffmpeg，可通过以下命令获取

# ubuntu
sudo apt install ffmpeg
# centos
sudo yum install ffmpeg

Embedding

将语音文件进行向量化，转换为1x1xD的向量，然后使用计算distance的方式描述不同语音之间的相似度（声纹），以此来识别说话人。

from pyannote.audio import Model
# 初始化模型
model = Model.from_pretrained("pyannote/wespeaker-voxceleb-resnet34-LM")

from pyannote.audio import Inference
# 使用模型将语音文件转换为一个array
inference = Inference(model, window="whole")
embedding_1 = inference("audio_peppa_1.wav")
embedding_2 = inference("audio_peppa_2.wav")
embedding_3 = inference("audio_aside.wav")

from scipy.spatial.distance import cdist
# 获取相似度
distance_sim = cdist(embedding_1 , embedding_2 , metric="cosine")[0,0]
distance_not = cdist(embedding_1 , embedding_3 , metric="cosine")[0,0]

print("Distance audio_peppa_1 between audio_peppa_2: " + str(distance1))
print("Distance audio_peppa_1 between audio_aside: " + str(distance2))

将多种语音的对比结果如下，可以明显看出，不同声音的声纹差距较大
在这里插入图片描述
pyannote中内置了将distance转换为similarity的方式，通过阈值（默认为0.25）直接判断是否为同一个speaker

Segmentation

以上的声纹对比是我们在听过语音以后手动将语音分割成多个wav文件，再将每个语音文件进行对比。

实际应用中，一个语音存在多人对话、同时说话以及语音静默等情况，就要使用另外一个模型将一个长语音分割为多个segmentation。
多人对话时的场景
此处使用segmentation-3.0进行语音的处理，注意该模型仅适配pyannote.audio 3.0以上的版本。

检测语音文件中active和overlapped的speech region，需要设置有效和无效的最小时长。

from pyannote.audio import Model
# 初始化模型
model = Model.from_pretrained("/home/pyannote-audio/tools/segmentation/pytorch_model.bin")

# 语音活动检测
from pyannote.audio.pipelines import VoiceActivityDetection
pipeline = VoiceActivityDetection(segmentation=model)

# 语音重叠检测
# from pyannote.audio.pipelines import OverlappedSpeechDetection
# pipeline = OverlappedSpeechDetection(segmentation=model)

# 设置参数
HYPER_PARAMETERS = {
  # remove speech regions shorter than that many seconds.
  "min_duration_on": 0.0,
  # fill non-speech regions shorter than that many seconds.
  "min_duration_off": 0.0
}
pipeline.instantiate(HYPER_PARAMETERS)
# 语音活动检测
vad = pipeline("audio_all.wav")
# 语音重叠检测
# osd = pipeline("audio_all.wav")

语音活动检测
语音重叠检测

此处可以看到我再调用模型时不再使用从huggingface上拉取，而是手动下载本地目录后再调用，方便管理，对于不能直接使用huggingface的同学也可以采用这种方法。
在这里插入图片描述

Diarization

结合以上两种方式，最终实现声纹识别即speaker recognization，该部分需要同时使用以上两种模型，代码示例如下

from pyannote.audio import Pipeline
# 读取配置文件
pipeline = Pipeline.from_pretrained("config.yaml")

# 使用GPU，没有的可以注释掉以下2行
import torch
pipeline.to(torch.device("cuda"))

# 声纹识别
diarization = pipeline("audio_all.wav")

# 输出结果
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

其中配置文件其实就是以上2个模型的相关信息，详细如下：

version: 3.1.0

pipeline:
  name: pyannote.audio.pipelines.SpeakerDiarization
  params:
    clustering: AgglomerativeClustering
    embedding: /home/pyannote-audio/tools/embedding/pytorch_model.bin
    embedding_batch_size: 32
    embedding_exclude_overlap: true
    segmentation: /home/pyannote-audio/tools/segmentation/pytorch_model.bin
    segmentation_batch_size: 32

params:
  clustering:
    method: centroid
    min_cluster_size: 12
    threshold: 0.7045654963945799
  segmentation:
    min_duration_off: 0.0