SHAP值不确定性量化：蒙特卡洛采样与置信区间-优快云博客

SHAP值不确定性量化：蒙特卡洛采样与置信区间

【免费下载链接】shap 项目地址: https://gitcode.com/gh_mirrors/sha/shap

引言

在机器学习模型解释领域，SHAP（SHapley Additive exPlanations）值作为一种基于边际贡献的解释方法，能够量化每个特征对模型预测的贡献。然而，SHAP值本身存在不确定性，特别是在使用近似方法计算时。本文将重点介绍如何通过蒙特卡洛（Monte Carlo）采样方法量化SHAP值的不确定性，并构建置信区间，帮助用户更可靠地解释模型预测结果。

SHAP值的不确定性来源

SHAP值的不确定性主要来源于以下几个方面：

近似计算方法：许多SHAP解释器（如KernelExplainer、PermutationExplainer）采用近似方法计算SHAP值，而非精确求解，这会引入随机误差。
背景数据采样：SHAP值的计算依赖于背景数据集，背景数据的选择和采样过程会影响最终结果的稳定性。
模型本身的随机性：某些机器学习模型（如随机森林、神经网络）本身具有随机性，导致相同输入可能产生不同输出，进而影响SHAP值。

蒙特卡洛采样在SHAP中的应用

蒙特卡洛采样是一种通过随机采样来估计复杂函数值的方法。在SHAP值计算中，蒙特卡洛采样可用于：

多次重复计算SHAP值，获取SHAP值的分布特征
估计SHAP值的均值和方差，进而构建置信区间
评估不同参数设置对SHAP值稳定性的影响

SHAP库中提供了多种支持蒙特卡洛采样的解释器，其中最常用的是PermutationExplainer和SamplingExplainer。

PermutationExplainer中的蒙特卡洛采样

PermutationExplainer通过对特征进行随机排列，多次计算特征重要性，从而估计SHAP值。其核心思想是通过多次随机排列特征顺序，计算特征加入模型时的边际贡献，取平均值作为SHAP值的估计。

from shap import PermutationExplainer

# 创建PermutationExplainer实例
explainer = PermutationExplainer(model.predict, background_data)

# 使用蒙特卡洛采样计算SHAP值，设置误差边界以获取不确定性估计
shap_values = explainer.shap_values(X, npermutations=100, error_bounds=True)

在PermutationExplainer的实现中，explain_row方法通过多次随机排列特征顺序（蒙特卡洛采样），计算每次排列下的特征边际贡献，并将这些贡献存储在row_values_history中，最后通过标准差估计不确定性：

# 代码片段来自[shap/explainers/_permutation.py](https://link.gitcode.com/i/1ebca0088b4392f05a7ea0f28b00831f)
if error_bounds:
    row_values_history = np.zeros(
        (
            2 * npermutations,
            len(fm),
        )
        + outputs.shape[1:]
    )

# 存储每次采样的边际贡献
if error_bounds:
    row_values_history[history_pos][ind] = outputs[i + 1] - outputs[i]

# 计算标准差作为不确定性估计
"error_std": None if row_values_history is None else row_values_history.std(0)

SamplingExplainer中的蒙特卡洛采样

SamplingExplainer通过从背景数据中随机采样，模拟特征缺失的情况，从而估计SHAP值。与PermutationExplainer相比，SamplingExplainer更适合处理大型背景数据集。

from shap import SamplingExplainer

# 创建SamplingExplainer实例
explainer = SamplingExplainer(model.predict, background_data)

# 使用蒙特卡洛采样计算SHAP值，指定采样次数
shap_values = explainer.shap_values(X, nsamples=1000)

SamplingExplainer的sampling_estimate方法实现了核心的蒙特卡洛采样逻辑，通过多次随机采样背景数据，计算特征存在和缺失时的模型输出差异，进而估计SHAP值及其方差：

# 代码片段来自[shap/explainers/_sampling.py](https://link.gitcode.com/i/4498fc4b1720f523c30fd1682744ae5c)
def sampling_estimate(self, j, f, x, X, nsamples=10):
    X_masked = self.X_masked[: nsamples * 2, :]
    inds = np.arange(X.shape[1])
    
    for i in range(nsamples):
        np.random.shuffle(inds)  # 随机排列特征，蒙特卡洛采样
        pos = np.where(inds == j)[0][0]
        rind = np.random.randint(X.shape[0])  # 从背景数据中随机采样
        X_masked[i, :] = x
        X_masked[i, inds[pos + 1 :]] = X[rind, inds[pos + 1 :]]
        X_masked[-(i + 1), :] = x
        X_masked[-(i + 1), inds[pos:]] = X[rind, inds[pos:]]
    
    evals = f(X_masked)
    evals_on = evals[:nsamples]
    evals_off = evals[nsamples:][::-1]
    d = evals_on - evals_off
    
    return np.mean(d, 0), np.var(d, 0)  # 返回均值和方差

SHAP值置信区间的构建

有了SHAP值的蒙特卡洛采样结果，我们可以构建置信区间来量化不确定性。常用的方法是基于正态分布的置信区间和基于bootstrap的置信区间。

基于正态分布的置信区间

假设SHAP值的采样分布近似正态分布，则95%置信区间可表示为：

$$\text{CI} = [\hat{\phi} - 1.96 \times \text{SE}(\hat{\phi}), \hat{\phi} + 1.96 \times \text{SE}(\hat{\phi})]$$

其中，$\hat{\phi}$是SHAP值的估计均值，$\text{SE}(\hat{\phi})$是标准误（标准差除以样本量的平方根）。

在PermutationExplainer中，可通过error_std属性获取SHAP值的标准差，进而计算标准误和置信区间：

# 获取SHAP值和标准差
shap_values = explainer.shap_values(X, error_bounds=True)
phi = shap_values.values
se = shap_values.error_std / np.sqrt(npermutations)  # 标准误

# 计算95%置信区间
ci_lower = phi - 1.96 * se
ci_upper = phi + 1.96 * se

基于bootstrap的置信区间

Bootstrap方法通过对原始数据进行有放回抽样，生成多个bootstrap样本，计算每个样本的SHAP值，进而构建置信区间。

import numpy as np

def bootstrap_shap_confidence_interval(explainer, X, n_bootstrap=100, alpha=0.05):
    shap_bootstrap = []
    
    for _ in range(n_bootstrap):
        # 有放回抽样
        bootstrap_indices = np.random.choice(len(X), size=len(X), replace=True)
        X_bootstrap = X.iloc[bootstrap_indices]
        
        # 计算SHAP值
        shap_values = explainer.shap_values(X_bootstrap)
        shap_bootstrap.append(shap_values)
    
    # 转换为数组
    shap_bootstrap = np.array(shap_bootstrap)
    
    # 计算置信区间
    lower_percentile = alpha / 2 * 100
    upper_percentile = (1 - alpha / 2) * 100
    ci_lower = np.percentile(shap_bootstrap, lower_percentile, axis=0)
    ci_upper = np.percentile(shap_bootstrap, upper_percentile, axis=0)
    
    return ci_lower, ci_upper

可视化SHAP值不确定性

SHAP库提供了多种可视化工具，可以直观展示SHAP值的不确定性。以下是一些常用的可视化方法：

带误差棒的SHAP条形图

import shap

# 计算SHAP值及置信区间
explainer = PermutationExplainer(model.predict, background_data)
shap_values = explainer.shap_values(X, npermutations=100, error_bounds=True)

# 绘制带误差棒的条形图
shap.summary_plot(shap_values, X, plot_type="bar", show_errorbars=True)

SHAP值散点图与置信区间

# 绘制SHAP值散点图，颜色表示特征值
shap.decision_plot(explainer.expected_value, shap_values.values, X, show_confidence=True)

该图展示了模型预测的决策路径，阴影区域表示SHAP值的置信区间，帮助用户理解预测的不确定性。

实践建议

选择合适的解释器：
- 对于小型数据集，优先使用PermutationExplainer，设置npermutations=100-1000
- 对于大型背景数据集，选择SamplingExplainer，设置nsamples=1000-10000
平衡计算效率与准确性：
- 增加采样次数（npermutations或nsamples）可以降低不确定性，但会增加计算时间
- 对于初步探索，可使用较少采样次数；对于最终报告，建议使用较多采样次数
结合领域知识：
- SHAP值的不确定性可能源于数据质量、模型稳定性等多种因素
- 高不确定性的特征可能需要进一步的数据收集或特征工程
参考官方文档：
- 更多详细信息请参考SHAP官方文档
- 示例代码可参考notebooks/overviews/目录下的Jupyter notebooks

结论

蒙特卡洛采样是量化SHAP值不确定性的有效方法，通过PermutationExplainer和SamplingExplainer可以方便地实现。构建SHAP值的置信区间并进行可视化，能够帮助用户更全面地理解模型预测，避免过度解释不稳定的SHAP值。在实际应用中，应根据数据规模和计算资源选择合适的解释器和采样次数，平衡解释的准确性和效率。

【免费下载链接】shap 项目地址: https://gitcode.com/gh_mirrors/sha/shap

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考