论文阅读：A Benchmark for Interpretability Methods in Deep Neural Networks

原创已于 2022-06-03 15:49:15 修改 · 476 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #机器学习 #人工智能

于 2022-06-02 15:43:50 首次发布

研究发现，许多深度学习模型的特征重要性估计方法并不比随机分配准确。ROAR（RemoveAndRetrain）方法通过在移除重要特征后重新训练模型来评估其准确性。实验显示，只有VarGrad和SmoothGrad-Squared等基于集成的方法在多个数据集上表现出优于随机分配和基本方法的性能。然而，并非所有集成方法都能提升性能，如ClassicSmoothGrad。该工作强调了评估特征重要性时重训练模型的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

https://proceedings.neurips.cc/paper/2019/hash/fe4b8556000d0f0cae99daa5c5c5a410-Abstract.html
A Benchmark for Interpretability Methods in Deep Neural Networks Sara
Hooker, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim

0 Abstract

提出了一种深度神经网络中评估特征重要性近似精度的经验度量方法。
在几个大规模图像分类数据集上的结果表明，许多流行的可解释性方法产生的特征重要性估计值并不比随机指定的特征重要性好。
只有某些基于集成的方法——VarGrad和SmoothGrad Squared——比这种随机分配的重要性更好。
并且集成的方式也很重要，一些集成方法的结果并不比底层方法做得更好，还会带来更高的计算负担。

1 Introduction

机器学习中，一个非常有趣的问题是评估各个输入特征对模型预测结果的影响。了解哪些特征重要有利于改进模型，建立对模型的信任并避免不良行为。但是，特征重要性是没有基本事实的（ground truth），而且我们也不知道众多评估特征重要性的方法中哪一个最应该被选择。所以我们需要建立一个框架在经验上验证这些方法的相对优点和可靠性。

一种常用的策略是删除输入中的informative features，并看看训练好的分类器的预测性能如何退化。但这样改变了样本分布，并且没有重新训练模型，所以不清楚性能的退化是由于数据分布的shift还是因为被删除的特征确实是informative features。

作者进行了一项实验，利用ImageNet数据集训练ResNet-50，发现，移除掉90%的像素信息，经过重训练的模型仍然能实现63.53%的准确度，而在干净的数据下准确率也只有76.68%。说明，在不重训练的条件下，模型性能的下降很可能是由数据分布的漂移造成的，而不是因为缺少了informative features。

因此，这项工作中，作者在移除了被认为是重要的特征之后，重新训练模型，通过验证重训练模型精度如何退化来评估可解释性方法。并将这种方法称为ROAR(RomOve And Rtrain)：

对于每一个特征重要性评估器，移除掉该评估器认为最重要的像素点，移除的这些像素点用一个固定的uninformative value代替
注意，这样的modification是同时针对训练集和测试集进行的，以保证训练和测试数据来源于类似的分布
然后重新在modified datasets上重新训练一个模型，并测试预测准确度
如果一个评估器认为的重要像素点被移除后，重训练的模型准确度显著下降，那么就认为该特征重要性解释器是较为精确的
作者还将每种方法与随机分配特征重要性和sobel边缘滤波器进行比较

作者在ImageNet，Food 101，以及Birdsnap三个数据集上应用了ROAR方法，发现：

Training performance is quite robust to removing input features. For example, after randomly replacing 90% of all ImageNet input features, we can still train a model that achieves 63.53 ± 0.13 (average across 5 independent runs). This implies that a small subset of features are sufficient for the actual decision making. Our observation is consistent across datasets.
The base methods we evaluate are no better or on par with a random estimate at finding the core set of informative features. However, we show that SmoothGrad-Squared (an unpublished variant of Classic SmoothGrad) and Vargrad, methods which ensemble a set of estimates produced by basic methods, far outperform both the underlying method and a random guess. These results are consistent across datasets and methods.
Not all ensemble estimators improve performance. Classic SmoothGrad is worse than a single estimate despite being more computationally intensive.

2 Related Work

explanation mathods for interpreting neural natwork:

distill or constrian a model into a functional form that is considered more inpterpretable
explore the role of neurons or activations in hidden layers of the network
use high level concepts to explain prediction results
input feature importance estimators

To the best of our knowledge, unlike prior modification based evaluation measures, our benchmark requires retraining the model from random initialization on the modified dataset rather than re-scoring the modified image at inference time. Without this step, we argue that one cannot decouple whether the model’s degradation in performance is due to artifacts introduced by the value used to replace the pixels that are removed or due to the approximate accuracy of the estimator.

3 ROAR：Remove And Retrain

为了利用ROAR方法评估不同的特征重要性评估器，我们根据特征重要性对输入特征进行排序，设一个样本的特征重要性评估结果为e，排序之后的样本表示为 ${eio}i=1N\{e^o_i\}_{i=1}^N$ 。然后设修改的比例为t， $t = [0, 10, . . ., 100]$ ,每次用原始图像各通道的均值像素按特征重要性顺序替换t%的像素点。另外，由于重训练会导致模型会有一定差异，所以为了减小方差，对每种情况都进行了5次独立训练。

重训练具有局限性，比如，虽然模型结构是一样的，但特征重要性评估是基于原模型的，而特征重要性评估器的评估是基于重训练模型。为了理解ROAR为何依然有意义，我们需要思考一下两种情形：

We remove input dimensions and the accuracy drops. In this case, it is very likely that the removed inputs were informative to the original model. ROAR thus gives a good indication that the importance estimate is of high quality.
We remove inputs and the accuracy does not drop. This can be explained as either:
(a) It could be caused by removal of an input that was uninformative to the model. This includes the case where the input might have been informative but not in a way that is useful to the model, for example, when a linear model is used and the relation between the feature and the output is non-linear. Since in such a case the information was not used by the model and it does not show in ROAR we can assume ROAR behaves as intended.
(b) There might be redundancy in the inputs. The same information could represented in another feature. This behavior can be detected with ROAR as we will show in our toy data experiment.

作者还在人工数据集上对ROAR和non-retraining方法进行了对比：

生成人工数据集： $x=a⃗z10+d⃗η+ϵ10x=\frac{\vec{a}z}{10}+\vec{d}\eta+\frac{\epsilon}{10}$ $y = (z > 0)$ ，其中 $a⃗\vec{a}$ 和 $d⃗\vec{d}$ 都是16维向量，其中 $a⃗\vec{a}$ 只有前四维是非零数值，保证只有四个informative features
比较三种排序：ground truth importance ranking， random importance ranking， inverted ground truth importance ranking

4 Large scale experiments

4.1 Estimators under consideration

4.1.1 Base estimators

Base estimators are estimators that compute a single estimate of importance (as opposed to ensemble methods).

Gradients or Sensitivity heatmaps (GRAD)
Guided Backprop (GB)
Integrated Gradients (IG)

4.1.2 Ensembling methods

In addition to the base approaches we also evaluate three ensembling methods for feature importance. For all the ensemble approaches that we describe below (SG, SG-SQ, Var), we average over a set of 15 estimates as suggested by in the original SmoothGrad publication.

Classic SmoothGrad (SG)
SmoothGrad2 (SG-SQ)
VarGrad (Var)

4.1.3 Control Variants

As a control, we compare each estimator to two rankings (a random assignment of importance and a sobel edge filter) that do not depend at all on the model parameters. These controls represent a lower bound in performance that we would expect all interpretability methods to outperform.

Random：A random estimator $g^R$ assigns a random binary importance probability $e→0,1e\rightarrow 0,1$ . This amounts to a binary vector $e∼Bernoulli(1−t)e\sim Bernoulli(1-t)$ where $(1 - t)$ is the probability of $e_i=1$ . The formulation of $g^R$ does not depend on either the model parameters or the input image (beyond the number of pixels in the image).
Sobel Edge Filter convolves a hard-coded, separable, integer filter over an image to produce a mask of derivatives that emphasizes the edges in an image. A sobel mask treated as a ranking e will assign a high score to areas of the image with a high gradient (likely edges).

4.2 Experimental setup

use a ResNet-50 model for both generating the feature importance estimates and subsequent training on the modified inputs.
evaluate ROAR on three open source image datasets: ImageNet, Birdsnap and Food 101.
For each dataset and estimator, we generate new train and test sets that each correspond to a different fraction of feature modification t = [0, 10, 30, 50, 70, 90].
We evaluate 18 estimators in total (this includes the base estimators, a set of ensemble approaches wrapped around each base and finally a set of squared estimates).
In total, we generate 540 large-scale modified image datasets in order to consider all experiment variants (180 new test/train for each original dataset).
We independently train 5 ResNet-50 models from random initialization on each of these modified dataset and report test accuracy as the average of these 5 runs.

4.3 Experimental results

4.3.1 Evaluating the random ranking

Goal：to answer the question: is the estimate of importance more accurate than a random guess?
results:
- model performance is remarkably robust to random modification：After replacing a large portion of all inputs with a constant value, the model not only trains but still retains most of the original predictive power
- on ImageNet, when only 10% of all features are retained, the trained model still attains 63.53% accuracy (relative to unmodified baseline of 76.68%).
  - suggests a case where many inputs are likely redundant.
  - provides additional support for the need to re-train

4.3.2 Evaluating Base Estimators

the left inset of Fig. 4 shows that these estimators consistently perform worse than the random assignment of feature importance across all datasets and for all thresholds t = [0.1, 0.3, 0.5, 0.7, 0.9].
our estimators fall further behind the accuracy of random guess as a larger fraction t of inputs is modified. The gap is widest when t = 0.9.
base estimators also do not compare favorably to the performance of a sobel edge filter SOBEL.
Base estimators perform within a very narrow range
compare performance of the base estimators using ROAR re-training vs. non-retrain, The base estimators appear to be working when we do not retrain, but they are clearly not better than the random baseline when evaluated using ROAR. This provides additional support for the need to re-train.

4.3.3 Evaluating Ensemble Approaches

Classic Smoothgrad：despite averaging SG degrades test-set accuracy still less than a random guess. In addition, for GRAD and IG SmoothGrad performs worse than a single estimate.
SmoothGrad-Squared and VarGrad：both VarGrad (VAR) and SmoothGrad-Squared (SG-SQ) far outperform the two control variants. However, the overall ranking of estimator performance differs by dataset, so the choice of the best underlying estimator may vary by task.

5. Conclusion and Future Work

propose ROAR to evaluate the quality of input feature importance estimators.
the commonly used base estimators, Gradients, Integrated Gradients and Guided BackProp are worse or on par with a random assignment of importance.
certain ensemble approaches such as SmoothGrad are far more computationally intensive but do not improve upon a single estimate (and in some cases are worse)
VarGrad and SmoothGrad-Squared strongly improve the quality of these methods and far outperform a random guess
While we venture some initial consideration of why certain ensemble methods far outperform other estimator, the divergence in performance between the ensemble estimators is an important direction of future research.