Reliable Visualization for Deep Speaker Recognition - 语音可解释性-优快云博客

本文链接：https://blog.youkuaiyun.com/shadowismine/article/details/136189855

本文探讨了三种视觉化方法（Grad-CAM++,Score-CAM和Layer-CAM）在语音识别中的应用，发现Layer-CAM在区分目标说话者和干扰者方面表现出色，尤其是在多说话者实验中。Layer-CAM生成的S2层地图被认为最具鉴别力，且跨层聚合可以进一步提升性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

MOTIVATION OF READING: 语音任务可解释性

Link: http://arxiv.org/abs/2204.03852

Code：http://project.cslt.org/

1. Overview

Motivation of the work:

If any of the visualization tools are reliable when applied to speaker recognition, which makes the conclusions obtained from visualization not fully convincing.

Three CAM algorithms will be investigated: Grad-CAM++, Score-CAM and Layer-CAM. The main idea of these algorithms is to generate a saliency map by combining the activation maps (channels) of a convolutional layer.

2. Mehodology

A class activation map (CAM) is a saliency map that shows the important regions used by the CNN to identify a particular class.

2.1 Grad-CAM and Grad-CAM++

Grad-CAM

Grad-CAM++

2.2 Score-CAM

2.3 Layer-CAM

3. Experiment

Speaker model

3.1 Single-speaker experiment

Grad-CAM++ and Score-CAM tend to regard all the speech segments being important, while Layer-CAM produces more selective and localized patterns.

It shows that the three CAM algorithms indeed find salient regions. For example, in the insertion experiment, the curves of CAM algorithms clearly are much higher than that of the random masking, indicating that the regions exposed earlier by CAMs are indeed more important than random regions.

3.2 Multi-speaker experiment

In the multi-speaker experiment, we concatenate an utterance of the target speaker with one or two utterances of other interfering speakers, and draw the saliency map.

A denotes the target speaker while B denotes the interfering speaker.

Layer-CAM shows surprisingly good performance: it can accurately locate the segments of the
target speaker, and mask non-target speakers almost perfectly. In comparison, Grad-CAM++ and Score-CAM are very weak in detecting non-target speakers.

It can be seen that Layer-CAM gains much better AUCs than the other two CAMs.

3.3 Localization and recognition

Since Layer-CAM can localize target speakers, we can use it as a tool to perform localization and recognition.

Firstly identify where the target speaker resides and then perform speaker recognition with the located segments only. We assume this is better than using the entire utterance.

OBSERVATION:

1. Layer-CAM, in contrast, delivers remarkable performance improvement, and this is the case for the saliency maps at all layers.

2. Although saliency maps at all layers produced by Layer-CAM are informative, the one from S2
seems the most discriminative. One possibility is that the saliency map of S2 is more conservative and retains more regions when compared to the ones obtained from higher layers.

3. We find that for Layer-CAM, aggregating saliency maps from different layers can improve performance.