论文阅读:A Benchmark for Interpretability Methods in Deep Neural Networks

研究发现,许多深度学习模型的特征重要性估计方法并不比随机分配准确。ROAR(RemoveAndRetrain)方法通过在移除重要特征后重新训练模型来评估其准确性。实验显示,只有VarGrad和SmoothGrad-Squared等基于集成的方法在多个数据集上表现出优于随机分配和基本方法的性能。然而,并非所有集成方法都能提升性能,如ClassicSmoothGrad。该工作强调了评估特征重要性时重训练模型的重要性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

https://proceedings.neurips.cc/paper/2019/hash/fe4b8556000d0f0cae99daa5c5c5a410-Abstract.html
A Benchmark for Interpretability Methods in Deep Neural Networks Sara
Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim

0 Abstract

  • 提出了一种深度神经网络中评估特征重要性近似精度的经验度量方法。
  • 在几个大规模图像分类数据集上的结果表明,许多流行的可解释性方法产生的特征重要性估计值并不比随机指定的特征重要性好。
  • 只有某些基于集成的方法——VarGrad和SmoothGrad Squared——比这种随机分配的重要性更好。
  • 并且集成的方式也很重要,一些集成方法的结果并不比底层方法做得更好,还会带来更高的计算负担。

1 Introduction

机器学习中,一个非常有趣的问题是评估各个输入特征对模型预测结果的影响。了解哪些特征重要有利于改进模型,建立对模型的信任并避免不良行为。但是,特征重要性是没有基本事实的(ground truth),而且我们也不知道众多评估特征重要性的方法中哪一个最应该被选择。所以我们需要建立一个框架在经验上验证这些方法的相对优点和可靠性。

一种常用的策略是删除输入中的informative features,并看看训练好的分类器的预测性能如何退化。但这样改变了样本分布,并且没有重新训练模型,所以不清楚性能的退化是由于数据分布的shift还是因为被删除的特征确实是informative features。

作者进行了一项实验,利用ImageNet数据集训练ResNet-50,发现,移除掉90%的像素信息,经过重训练的模型仍然能实现63.53%的准确度,而在干净的数据下准确率也只有76.68%。说明,在不重训练的条件下,模型性能的下降很可能是由数据分布的漂移造成的,而不是因为缺少了informative features。

因此,这项工作中,作者在移除了被认为是重要的特征之后,重新训练模型,通过验证重训练模型精度如何退化来评估可解释性方法。并将这种方法称为ROAR(RomOve And Rtrain):

  • 对于每一个特征重要性评估器,移除掉该评估器认为最重要的像素点,移除的这些像素点用一个固定的uninformative value代替
  • 注意,这样的modification是同时针对训练集和测试集进行的,以保证训练和测试数据来源于类似的分布
  • 然后重新在modified datasets上重新训练一个模型,并测试预测准确度
  • 如果一个评估器认为的重要像素点被移除后,重训练的模型准确度显著下降,那么就认为该特征重要性解释器是较为精确的
  • 作者还将每种方法与随机分配特征重要性和sobel边缘滤波器进行比较

作者在ImageNet,Food 101,以及Birdsnap三个数据集上应用了ROAR方法,发现:

  • Training performance is quite robust to removing input features. For example, after randomly replacing 90% of all ImageNet input features, we can still train a model that achieves 63.53 ± 0.13 (average across 5 independent runs). This implies that a small subset of features are sufficient for the actual decision making. Our observation is consistent across datasets.
  • The base methods we evaluate are no better or on par with a random estimate at finding the core set of informative features. However, we show that SmoothGrad-Squared (an unpublished variant of Classic SmoothGrad) and Vargrad, methods which ensemble a set of estimates produced by basic methods, far outperform both the underlying method and a random guess. These results are consistent across datasets and methods.
  • Not all ensemble estimators improve performance. Classic SmoothGrad is worse than a single estimate despite being more computationally intensive.

2 Related Work

explanation mathods for interpreting neural natwork:

  • distill or constrian a model into a functional form that is considered more inpterpretable
  • explore the role of neurons or activations in hidden layers of the network
  • use high level concepts to explain prediction results
  • input feature importance estimators

To the best of our knowledge, unlike prior modification based evaluation measures, our benchmark requires retraining the model from random initialization on the modified dataset rather than re-scoring the modified image at inference time. Without this step, we argue that one cannot decouple whether the model’s degradation in performance is due to artifacts introduced by the value used to replace the pixels that are removed or due to the approximate accuracy of the estimator.

3 ROAR:Remove And Retrain

为了利用ROAR方法评估不同的特征重要性评估器,我们根据特征重要性对输入特征进行排序,设一个样本的特征重要性评估结果为e,排序之后的样本表示为{eio}i=1N\{e^o_i\}_{i=1}^N{eio}i=1N。然后设修改的比例为t,t=[0,10,...,100]t=[0, 10, ..., 100]t=[0,10,...,100],每次用原始图像各通道的均值像素按特征重要性顺序替换t%的像素点。另外,由于重训练会导致模型会有一定差异,所以为了减小方差,对每种情况都进行了5次独立训练。

重训练具有局限性,比如,虽然模型结构是一样的,但特征重要性评估是基于原模型的,而特征重要性评估器的评估是基于重训练模型。为了理解ROAR为何依然有意义,我们需要思考一下两种情形:

  1. We remove input dimensions and the accuracy drops. In this case, it is very likely that the removed inputs were informative to the original model. ROAR thus gives a good indication that the importance estimate is of high quality.
  2. We remove inputs and the accuracy does not drop. This can be explained as either:
    (a) It could be caused by removal of an input that was uninformative to the model. This includes the case where the input might have been informative but not in a way that is useful to the model, for example, when a linear model is used and the relation between the feature and the output is non-linear. Since in such a case the information was not used by the model and it does not show in ROAR we can assume ROAR behaves as intended.
    (b) There might be redundancy in the inputs. The same information could represented in another feature. This behavior can be detected with ROAR as we will show in our toy data experiment.

作者还在人工数据集上对ROAR和non-retraining方法进行了对比:

  • 生成人工数据集:x=a⃗z10+d⃗η+ϵ10x=\frac{\vec{a}z}{10}+\vec{d}\eta+\frac{\epsilon}{10}x=10az+dη+10ϵy=(z>0)y=(z>0)y=(z>0),其中a⃗\vec{a}ad⃗\vec{d}d都是16维向量,其中a⃗\vec{a}a只有前四维是非零数值,保证只有四个informative features
  • 比较三种排序:ground truth importance ranking, random importance ranking, inverted ground truth importance ranking

4 Large scale experiments

4.1 Estimators under consideration

4.1.1 Base estimators

Base estimators are estimators that compute a single estimate of importance (as opposed to ensemble methods).

  • Gradients or Sensitivity heatmaps (GRAD)
  • Guided Backprop (GB)
  • Integrated Gradients (IG)

4.1.2 Ensembling methods

In addition to the base approaches we also evaluate three ensembling methods for feature importance. For all the ensemble approaches that we describe below (SG, SG-SQ, Var), we average over a set of 15 estimates as suggested by in the original SmoothGrad publication.

  • Classic SmoothGrad (SG)
  • SmoothGrad2 (SG-SQ)
  • VarGrad (Var)

4.1.3 Control Variants

As a control, we compare each estimator to two rankings (a random assignment of importance and a sobel edge filter) that do not depend at all on the model parameters. These controls represent a lower bound in performance that we would expect all interpretability methods to outperform.

  • Random:A random estimator gRg^RgRassigns a random binary importance probability e→0,1e\rightarrow 0,1e0,1. This amounts to a binary vector e∼Bernoulli(1−t)e\sim Bernoulli(1-t)eBernoulli(1t)where (1−t)(1-t)(1t)is the probability of ei=1e_i=1ei=1. The formulation of gRg^RgRdoes not depend on either the model parameters or the input image (beyond the number of pixels in the image).
  • Sobel Edge Filter convolves a hard-coded, separable, integer filter over an image to produce a mask of derivatives that emphasizes the edges in an image. A sobel mask treated as a ranking e will assign a high score to areas of the image with a high gradient (likely edges).

4.2 Experimental setup

  • use a ResNet-50 model for both generating the feature importance estimates and subsequent training on the modified inputs.
  • evaluate ROAR on three open source image datasets: ImageNet, Birdsnap and Food 101.
  • For each dataset and estimator, we generate new train and test sets that each correspond to a different fraction of feature modification t = [0, 10, 30, 50, 70, 90].
  • We evaluate 18 estimators in total (this includes the base estimators, a set of ensemble approaches wrapped around each base and finally a set of squared estimates).
  • In total, we generate 540 large-scale modified image datasets in order to consider all experiment variants (180 new test/train for each original dataset).
  • We independently train 5 ResNet-50 models from random initialization on each of these modified dataset and report test accuracy as the average of these 5 runs.

4.3 Experimental results

4.3.1 Evaluating the random ranking

  • Goal:to answer the question: is the estimate of importance more accurate than a random guess?
  • results:
    • model performance is remarkably robust to random modification:After replacing a large portion of all inputs with a constant value, the model not only trains but still retains most of the original predictive power
    • on ImageNet, when only 10% of all features are retained, the trained model still attains 63.53% accuracy (relative to unmodified baseline of 76.68%).
      • suggests a case where many inputs are likely redundant.
      • provides additional support for the need to re-train

4.3.2 Evaluating Base Estimators

  • the left inset of Fig. 4 shows that these estimators consistently perform worse than the random assignment of feature importance across all datasets and for all thresholds t = [0.1, 0.3, 0.5, 0.7, 0.9].
  • our estimators fall further behind the accuracy of random guess as a larger fraction t of inputs is modified. The gap is widest when t = 0.9.
  • base estimators also do not compare favorably to the performance of a sobel edge filter SOBEL.
  • Base estimators perform within a very narrow range
  • compare performance of the base estimators using ROAR re-training vs. non-retrain, The base estimators appear to be working when we do not retrain, but they are clearly not better than the random baseline when evaluated using ROAR. This provides additional support for the need to re-train.

4.3.3 Evaluating Ensemble Approaches

  • Classic Smoothgrad:despite averaging SG degrades test-set accuracy still less than a random guess. In addition, for GRAD and IG SmoothGrad performs worse than a single estimate.
  • SmoothGrad-Squared and VarGrad:both VarGrad (VAR) and SmoothGrad-Squared (SG-SQ) far outperform the two control variants. However, the overall ranking of estimator performance differs by dataset, so the choice of the best underlying estimator may vary by task.

5. Conclusion and Future Work

  • propose ROAR to evaluate the quality of input feature importance estimators.
  • the commonly used base estimators, Gradients, Integrated Gradients and Guided BackProp are worse or on par with a random assignment of importance.
  • certain ensemble approaches such as SmoothGrad are far more computationally intensive but do not improve upon a single estimate (and in some cases are worse)
  • VarGrad and SmoothGrad-Squared strongly improve the quality of these methods and far outperform a random guess
  • While we venture some initial consideration of why certain ensemble methods far outperform other estimator, the divergence in performance between the ensemble estimators is an important direction of future research.
### GLUE 基准及其在自然语言理解 (NLU) 多任务评估中的应用 #### 什么是 GLUE 基准? GLUE(General Language Understanding Evaluation)基准是一种用于评估自然语言处理模型语义理解能力的多任务框架[^1]。它由一系列 NLU 任务组成,旨在全面衡量模型的语言理解和泛化能力。 #### GLUE 的任务构成 GLUE 包含 9 个核心任务,这些任务均涉及单句或双句的理解问题,具体如下: - **MNLI**(Multi-Genre Natural Language Inference):跨多个领域判断句子之间的推理关系。 - **QQP**(Quora Question Pairs):检测 Quora 上的问题是否重复。 - **QNLI**(Question-NLI):通过自然语言推断回答给定问题。 - **RTE**(Recognizing Textual Entailment):识别文本蕴含关系。 - **STS-B**(Semantic Textual Similarity Benchmark):测量两个句子间的语义相似度。 - **MRPC**(Microsoft Research Paraphrase Corpus):判定两句话是否互为同义表达。 - **CoLA**(Corpus of Linguistic Acceptability):预测句子语法和语义接受程度。 - **SST-2**(Stanford Sentiment Treebank):分析电影评论的情感倾向。 - **WNLI**(Winograd NLI):解决代词消解问题。 上述任务涵盖了多种语言现象,包括但不限于逻辑推理、情感分析以及语义匹配等。 #### GLUE 在多任务学习中的作用 为了更好地支持多任务场景下的 NLP 模型开发,研究人员提出了基于 GLUE 的解决方案。例如,在一篇来自微软的研究论文中提到一种名为 MT-DNN(Multi-Task Deep Neural Networks)的方法,该方法能够有效提升单一模型在多项 NLU 任务上的综合表现[^2]。 此外,还有其他工作扩展了传统意义上的 GLUE 设计理念。比如 ASR-GLUE 将自动语音识别引入到标准 NLU 测试集中,进一步考察当输入存在不同程度噪音干扰时系统的鲁棒性表现[^4]。 #### 实际案例展示 以 BERT 和其变体为例说明如何利用 GLUE 数据集进行实验验证。下图展示了 SST-2 这一特定子任务上几种主流架构的表现情况对比图表显示即使面对加入随机扰动后的样本集合,“人类级”的基线仍然难以超越某些精心设计的人工智能算法。 ```python import matplotlib.pyplot as plt # Sample data representing accuracy across different noise levels. noise_levels = ['Clean', 'Low Noise', 'Medium Noise', 'High Noise'] human_performance = [0.87, 0.85, 0.83, 0.78] model_a_accuracy = [0.92, 0.88, 0.86, 0.80] plt.figure(figsize=(10, 6)) plt.plot(noise_levels, human_performance, label='Human Performance') plt.plot(noise_levels, model_a_accuracy, label="Model A's Accuracy", linestyle="--") plt.title('Accuracy Comparison Between Human and Model Under Various Noises on SST-2 Task') plt.xlabel('Noise Levels') plt.ylabel('Accuracy Score') plt.legend() plt.grid(True) plt.show() ``` 此代码片段绘制了一条折线图用来直观呈现随着环境复杂性的增加两者之间差距的变化趋势。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值