【隐语实践】逻辑回归LR与广义线性模型GLM开发实践

本文链接：https://blog.youkuaiyun.com/Rabbit_QL/article/details/139796710

课程链接：https://www.bilibili.com/video/BV1ky411B7bH

讲师：刘颖婷
环境：我们以v1.6.1b0版本为例，通过secretnote进行介绍。

一、背景知识-LR/GLM应用场景及原理

1. 场景

(1) 二分类问题：逻辑回归

对数据进行二元分类：例如对病人的数据进行疾病诊断，如左图所示。
预测某件事情发生的概率：例如预测一个网站的用户变成付费用户的概率，如右图所示。

(2) 广义线性模型（GLM）：保险理赔（保费预测）

风险保费预测，根据要提供的保障责任，计算预期总索赔额

直接对纯保费建模
- tweedie 分布（1,2）
也可以通过两步建模间接近似：纯保费 = 索赔次数 * 平均索赔金额
- 索赔次数：泊松分布、负二项分布
- 平均索赔金额：伽马分布、逆高斯分布

2. 广义线性模型

可以参考知乎大神作品：https://zhuanlan.zhihu.com/p/339380506

线性回归适用于响应变量服从正态分布的情况。
广义线性回归适用于响应变量服从各种指数分布族的情况，如二项分布、泊松分布等。

(1) 线性回归是最简单的广义线性模型

线性回归假设响应变量与解释变量之间存在线性关系，误差项独立且服从正态分布。
线性回归的公式

它是GLM的一个基本形式，其假设响应变量 $Y$ 的真实值由两部分组成：
$y_{actual} = \beta_0+\mathbf{x}_1\beta_1+...+\mathbf{x}_p\beta_p+\gamma$
其中， $y_{actual}$ 是响应变量， $\mathbf{x}_i$ 是解释变量， $\beta_i$ 是参数， $\gamma$ 是误差项。

系统组件(system component)：线性预测器 $\eta = \mathbf{x}^T \beta$ (数值项，可以拟合)

误差组件(error component)：白噪声，近似服从 $N (0, 1)$ (高斯随机变量)

因此，响应变量 $Y$ 的条件分布为高斯分布 $\sim N(\mathbf{x}^T\mathbf{\beta}, 1)$

(2) GLM是线性回归的扩展

广义线性回归允许响应变量服从广泛的分布，并通过连接函数将响应变量的期望值与解释变量的线性组合关联。

广义线性回归（Generalized Linear Model, GLM）是线性回归的扩展，用于处理不满足线性回归假设的数据。其模型形式是：
$g(\mu)=\beta_0+\mathbf{x}_1\beta_1+...+\mathbf{x}_p\beta_{p}$
其中， $g$ 是连接函数， $\mu$ 是响应变量的期望值。

一个广义线性模型有三个关键组件：
- 系统组件(线性预测器)：即 $\eta =\beta^T\mathbf{x} =\mathbf{x}_1\beta_1+...+\mathbf{x}_p\beta_{p}$
- 随机分布组件：响应变量 $Y$ 的概率分布 $p(Y;\theta)$ ，可以来自指数分布族，如正态分布、泊松分布、二项分布等。
  - 这里， $\theta$ 是分布的自然参数
  - $\theta$ 和 $\mu$ 存在一一映射关系，将这种关系记为 $\Psi$
- 连接函数 $g$ ：使得 $\eta = g(\mu)$ 将期望值 $\mu$ 与线性预测组件关联起来，描述系统组件 $\eta$ 和随机组件之间的关系。
  - 对于逻辑回归， $Y$ 的真实值为0或1，期望 $\mu$ 值的范围为 $[0, 1]$
  - 对于线性回归， $Y=\beta^T\mathbf{x}$ ，因为 $\mathbf{x} \in (-\infin, +\infin)$ ，因此 $Y$ 的期望 $\mu$ 的值为 $(-\infin, +\infin)$
  因此，链接函数就是用于这样的映射的。逻辑回归中用Sigmoid进行这样的映射；如果连接函数与 $\Psi$ 一致时，称为标准连接函数。常用连接函数如下：

在这里插入图片描述

	激活函数和连接函数互为一对逆运算。连接函数提供了一种将线性预测器与非线性概率分布联系起来的方式，而激活函数则是其逆操作，例如将线性预测器的结果转换为概率值。

GLM允许误差项的概率分布扩展为指数分布族: 伯努利分布（逻辑回归）、泊松分布、gamma分布、复合泊松Gamma分布、Tweedie分布等
常见的GLM模型：
- 线性回归：用于正态分布的响应变量，连接函数是恒等函数。
- 逻辑回归：用于二项分布的响应变量，连接函数是logit函数。
- 泊松回归：用于泊松分布的响应变量，连接函数是对数函数。

二、隐语模型-密态SSLR/SSGLM

1. 广义线性模型参数估计

(1) 一阶优化器：SGD参数估计方法

公式
- 计算损失函数对权重的偏导数
- 基于偏导数进行参数更新： $\beta_j = \beta_j - lr * \frac{\partial l}{\partial{\beta_j}}$
属于随机优化算法，主要用于大规模和在线学习问题，通过在每次迭代中使用一部分（或一个）数据点来更新参数。
用于最小化任意目标函数，特别是凸函数和大规模问题。SGD通过随机选择数据点来计算梯度，并逐步更新参数，以逼近全局最优解。

(2) 二阶优化器：迭代重加权最小二乘法（IRLS）

Iteratively Reweighted Least Squares, IRLS

https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares

优点：
- 初始化准确，根据 $E [y]$ 计算得到
- 二阶优化器，速度快
缺点：
- 计算量大
- 通信复杂度高
公式
属于确定性算法，主要用于解决加权最小二乘问题，通过在每次迭代中调整权重来降低异常值的影响。
主要用于加权最小二乘问题，目标是最小化加权残差平方和。IRLS通过迭代调整权重来处理异常值，使得拟合更稳健。

(3) 隐语实现：二阶优化器+一阶优化器

初始几轮，先用二阶优化器进行简单迭代，之后再用二阶优化器进行收敛。

2. 秘密分享加法

Alice拥有15，Bob拥有25，Charlie拥有10，如何计算三者的和？
在这里插入图片描述

(1) 秘密切分

将已拥有的秘密值进行分片，例如划分为 $n = 3$ 份，生成两个随机数 $r_1$ 和 $r_2$ ， $r_3 = r - r_1 - r_2$

Alice拥有15，先随机生成两个分片4和6，最后一个分片利用 $15 - 4 - 6 = 5$ 得到。
Bob拥有25，先随机生成8和9，最后一个分片利用 $25 - 8 - 9 = 8$ 得到。
Chalie拥有10，先随机生成3和5，最后一个分片利用 $10 - 3 - 5 = 2$ 得到

(2) 分片交换与求和

将分片进行分享，同事接受来自其他参与方的分片，并进行求和。

Alice拿到了Chalie的3和Bob的9，但是无法获知具体的秘密值。进行求和，得到 $3 + 9 + 6 = 18$
Bob拿到了Alice的5和Chalie的5，进行求和，得到 $5 + 5 + 8 = 18$
Chalie拿到了Alice的4和Bob的8，进行求和，得到 $4 + 8 + 2 = 14$

(3) 结果构建

收集三个本地结果，将其求和。

分片总和为 $18 + 18 + 14 = 50$ 。
验证下结果，明文总和为 $15 + 25 + 10 = 50$

3. 秘密分享乘法

这里讲解的时候似乎没有很详细

在这里插入图片描述

首先，进行秘密分片
- Alice将秘密 $a$ 分为了 $a_0$ 和 $a_1 = a - a_0$
- Bob将秘密 $b$ 分为了 $b_0$ 和 $b_1 = b - b_0$
将分片进行交换
- Alice拿到了Bob的 $b_0$
- Bob拿到了Alice的 $a_0$
初始化3个临时变量 $u$ 、 $v$ 和 $z$ ，满足 $z = uv$

Alice拿到了 $u_0$ 、 $v_0$ 和 $z_0$ ，Bob拿到了 $u_1$ 、 $v_1$ 和 $z_1$
本地计算
- Alice计算 $e_0=a_0-u_0$ 和 $f_0 = b_0-v_0$
- Bob计算 $e_1 = a_1 - u_1$ 和 $f_1 = b_1-v_1$
中间结果分享
- Alice将 $e_0$ 和 $f_0$ 分享给Bob
- Bob将 $e_1$ 和 $f_1$ 分享给Alice
本地计算
- 先在本地计算和
  - $e = e_0+e_1$
    
    因为 $e_0=a_0-u_0$ 和 $e_1 = a_1-u_1$
    
    因此相加得到 $e = a_0+a_1-u=a-u$
  - $f = f_0+f_1$
    
    因为 $f_0=b_0-v_0$ 和 $f_1 = b_1- v_1$
    
    因此相加得到 $f = b_0+b_1-v=b-v$
- Alice计算： $c_0 = fa_0+eb_0+z_0$
  
  $c_0=(b-v)a_0+(a-u)b_0+z_0$
- Bob计算： $c_1 = -ef+ fa_1+eb_1+z_1$
  
  $c_1=-(a-u)(b-v)+(b-v)a_1+(a-u)b_1+z_1$
汇集结果得到乘法值：
$c_0+c_1=a\times b$

$\begin{align} c_0+c_1 &=(b-v)a_0+(a-u)b_0+z_0 -(a-u)(b-v)+(b-v)a_1+(a-u)b_1+z_1 \\ &= (b-v)(a_0+a_1)+(a-u)(b_0+b_1) + z_0 + z_1 - (a-u)(b-v) \\ &= (b-v)a+(a-u)b+z-(a-u)(b-v) \\ &= ab-va+ab-ub+z - ab + av+ub-uv \\ &=ab+z-uv \\ &=ab \end{align}$

三、应用实现-从理论到隐语应用

这里主要使用GLM的component封装的组件进行应用

https://github.com/secretflow/secretflow/blob/v1.6.1b0/secretflow/component/ml/linear/ss_glm.py

1. 参数定义

distribution type

响应变量的分布类型

ss_glm_train_comp.str_attr(
    name="label_dist_type",
    desc="label distribution type",
    is_list=False,
    is_optional=False,
    allowed_values=["Bernoulli", "Poisson", "Gamma", "Tweedie"],
)

支持伯努利分布、泊松分布、gamma分布和Tweedie

链接函数

ss_glm_train_comp.str_attr(
    name="link_type",
    desc="link function type",
    is_list=False,
    is_optional=False,
    allowed_values=["Logit", "Log", "Reciprocal", "Identity"],
)

Log: 泊松分布、gamma分布和Tweedie

Logit：伯努利分布

tweedie的power值

一般范围选择 $[1, 2]$ ， $p$ 值根据先验知识得出。
```
ss_glm_train_comp.float_attr(
    name="tweedie_power",
    desc="Tweedie distribution power parameter",
    is_list=False,
    is_optional=True,
    default_value=1,
    lower_bound=0,
    lower_bound_inclusive=True,
    upper_bound=2,
    upper_bound_inclusive=True,
)
```

优化器选择

ss_glm_train_comp.str_attr(
    name="optimizer",
    desc="which optimizer to use: IRLS(Iteratively Reweighted Least Squares) or SGD(Stochastic Gradient Descent)",
    is_list=False,
    is_optional=False,
    allowed_values=["SGD", "IRLS"],
)

一阶优化器SGD
二阶优化器IRLS

也可以采用前几轮IRLS，之后用SGD

ss_glm_train_comp.int_attr(
    name="iter_start_irls",
    desc="""run a few rounds of IRLS training as the initialization of w,
    0 disable""",
    is_list=False,
    is_optional=True,
    default_value=0,
    lower_bound=0,
    lower_bound_inclusive=True,
)

dist_scale数据方差(先验)

ss_glm_train_comp.float_attr(
    name="dist_scale",
    desc="A guess value for distribution's scale",
    is_list=False,
    is_optional=True,
    default_value=1,
    lower_bound=1,
    lower_bound_inclusive=True,
)

这里，对于教程中提到的offset_col和weight_col部分的代码未在v1.6.1b0的component代码中找到。

2. SS-LR / SSGLM 在隐语实现具有的独特优势

可证安全：计算逻辑基于秘密分享
不依赖可信第三方（例如Fate的Arbiter）
支持多种模型（伯努利分布（逻辑回归），泊松分布，gamma分布，Tweedie分布）
计算高效：结合一阶优化器和二阶优化器的优点，整体提高SSGLM的迭代速度。

四、实践—Tweedie分布建模

需要自行配置secretnote界面哦，可以参考：https://www.bilibili.com/video/BV11Z421H79q

代码部分参考：https://github.com/secretflow/secretflow/blob/v1.6.1b0/docs/tutorial/ss_glm.ipynb

基于sklearn调用TweedieRegressior对open_mtpl2数据集进行建模
调用SF提供的SS_GLM接口，对open_mtpl2数据集进行建模
对比明/密文训练下得到的预测结果

1. 准备数据集

法国机动车第三方责任保险索赔数据集

def load_mtpl2(n_samples=None):
    """Fetch the French Motor Third-Party Liability Claims dataset.

    Parameters
    ----------
    n_samples: int, default=None
      number of samples to select (for faster run time). Full dataset has
      678013 samples.
    """
    # freMTPL2freq dataset from https://www.openml.org/d/41214
    df_freq = fetch_openml(data_id=41214, as_frame=True).data
    df_freq["IDpol"] = df_freq["IDpol"].astype(int)
    df_freq.set_index("IDpol", inplace=True)

    # freMTPL2sev dataset from https://www.openml.org/d/41215
    df_sev = fetch_openml(data_id=41215, as_frame=True).data

    # sum ClaimAmount over identical IDs
    df_sev = df_sev.groupby("IDpol").sum()

    df = df_freq.join(df_sev, how="left")
    df["ClaimAmount"].fillna(0, inplace=True)

    # unquote string fields
    for column_name in df.columns[df.dtypes.values == object]:
        df[column_name] = df[column_name].str.strip("'")
    return df.iloc[:n_samples]

涉及两个数据集：
- 从 OpenML 平台下载数据集 ID 为 41214 的 freMTPL2freq 数据集，并将其加载为 Pandas DataFrame。
  - 将 IDpol 列的数据类型转换为整数，并将其设为 DataFrame 的索引。
- 从 OpenML 平台下载数据集 ID 为 41215 的 freMTPL2sev 数据集，并将其加载为 Pandas DataFrame。
  - 对 df_sev 按 IDpol 列进行分组，并对每组的 ClaimAmount 列进行求和。这样可以确保每个 IDpol 只对应一个总的 ClaimAmount。
- 合并数据集
- 遍历 DataFrame 中所有字符串类型的列，去掉每个字符串值首尾的单引号。
- 根据参数 n_samples 返回指定数量的样本。如果未指定 n_samples，则返回整个 DataFrame。
这里假设取前50000个样本

df = load_mtpl2(n_samples=50000)

这里的df为(50000, 12)

2. 特征预处理和特征工程

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import (
    FunctionTransformer,
    KBinsDiscretizer,
    OneHotEncoder,
    StandardScaler,
)
# Note: filter out claims with zero amount, as the severity model
# requires strictly positive target values.
df.loc[(df["ClaimAmount"] == 0) & (df["ClaimNb"] >= 1), "ClaimNb"] = 0

# Correct for unreasonable observations (that might be data error)
# and a few exceptionally large claim amounts
df["ClaimNb"] = df["ClaimNb"].clip(upper=4)
df["Exposure"] = df["Exposure"].clip(upper=1)
df["ClaimAmount"] = df["ClaimAmount"].clip(upper=200000)

log_scale_transformer = make_pipeline(
    FunctionTransformer(func=np.log), StandardScaler()
)

column_trans = ColumnTransformer(
    [
        (
            "binned_numeric",
            KBinsDiscretizer(n_bins=10, subsample=int(2e5), random_state=0),
            ["VehAge", "DrivAge"],
        ),
        (
            "onehot_categorical",
            OneHotEncoder(),
            ["VehBrand", "VehPower", "VehGas", "Region", "Area"],
        ),
        ("passthrough_numeric", "passthrough", ["BonusMalus"]),
        ("log_scaled_numeric", log_scale_transformer, ["Density"]),
    ],
    remainder="drop",
)
X = column_trans.fit_transform(df)

保险公司感兴趣的是纯保费建模，即其投资组合中每个保单持有人每单位风险的预期总索赔金额，因此构造label为纯保费：df["PurePremium"] = df["ClaimAmount"] / df["Exposure"]

计算索赔率和平均索赔金额

df["Frequency"] = df["ClaimNb"] / df["Exposure"]
df["AvgClaimAmount"] = df["ClaimAmount"] / np.fmax(df["ClaimNb"], 1)

数据集划分

from sklearn.model_selection import train_test_split

df_train, df_test, X_train, X_test = train_test_split(df, X, random_state=0)

3. 利用sklearn的TweedieRegressor拟合明文数据集

超参数：
- power：决定了 Tweedie 分布的指数值。对于 power 在 (1, 2) 之间的值，Tweedie 分布适用于拥有零膨胀的正态分布数据（即数据包含许多零值和一些正值）。
  - 具体来说，power=1.9 通常用于处理保险索赔等数据，其中存在大量的零值和一些正值索赔金额。
- alpha：正则化参数，用于控制模型的复杂度。较大的 alpha 值会导致更多的正则化，从而减少过拟合，但可能也会降低模型的拟合程度。
- solver：优化算法，选择 newton-cholesky 作为求解器（这是一种用于处理较大规模数据集的优化算法，具有较高的计算效率）
input
- 训练特征数据集，包含独立变量（特征）。
- 这是训练目标变量数据集，这里我们选择PurePremium值。
- sample_weight为样本加权，更准确地反映不同样本的贡献。

训练

from sklearn.linear_model import TweedieRegressor

glm_pure_premium = TweedieRegressor(power=1.9, alpha=0.1, solver='newton-cholesky')
glm_pure_premium.fit(
    X_train, df_train["PurePremium"], sample_weight=df_train["Exposure"]
)

定义score查看拟合效果

def score_estimator(
    estimator,
    X_train,
    X_test,
    df_train,
    df_test,
    target,
    weights,
    tweedie_powers=None,
):
    """Evaluate an estimator on train and test sets with different metrics"""

    metrics = [
        ("D² explained", None),  # Use default scorer if it exists
        ("mean abs. error", mean_absolute_error),
        ("mean squared error", mean_squared_error),
    ]
    if tweedie_powers:
        metrics += [
            (
                "mean Tweedie dev p={:.4f}".format(power),
                partial(mean_tweedie_deviance, power=power),
            )
            for power in tweedie_powers
        ]

    res = []
    for subset_label, X, df in [
        ("train", X_train, df_train),
        ("test", X_test, df_test),
    ]:
        y, _weights = df[target], df[weights]
        for score_label, metric in metrics:
            if isinstance(estimator, tuple) and len(estimator) == 2:
                # Score the model consisting of the product of frequency and
                # severity models.
                est_freq, est_sev = estimator
                y_pred = est_freq.predict(X) * est_sev.predict(X)
            else:
                y_pred = estimator.predict(X)

            if metric is None:
                if not hasattr(estimator, "score"):
                    continue
                score = estimator.score(X, y, sample_weight=_weights)
            else:
                score = metric(y, y_pred, sample_weight=_weights)

            res.append({"subset": subset_label, "metric": score_label, "score": score})
    res = (
        pd.DataFrame(res)
        .set_index(["metric", "subset"])
        .score.unstack(-1)
        .round(4)
        .loc[:, ["train", "test"]]
    )
    return res

如果提供了 tweedie_powers 参数，还会计算不同 Tweedie deviance 的平均值。
partial(mean_tweedie_deviance, power=power) 创建了一个 mean_tweedie_deviance 函数的偏函数，其中 power 参数被设定为当前的 power 值。这样就可以在后续计算中直接使用这个函数，而无需每次都传递 power 参数。

查看得分

scores = pd.concat(
    [scores_glm_pure_premium],
    axis=1,
    sort=True,
    keys=("TweedieRegressor"),
)
print("Evaluation of the Product Model and the Tweedie Regressor on target PurePremium")
with pd.option_context("display.expand_frame_repr", False):
    print(scores)

查看迭代次数
```
n_iter_sklearn = glm_pure_premium.n_iter_
```
基于上述参数设定，迭代了5次

4. 利用spu进行建模

(1) 配置

ray cluster

import secretflow as sf

cluster_config = {
    "parties": {
        "alice": {
            # replace with alice's real address
            "address": "alice的任务通信端口",
            "listen_addr": "0.0.0.0:60411"

        },
        "bob": {
            "address": "bob的任务通信端口",
            "listen_addr": "0.0.0.0:42749"
        },
    },
    'self_party': 'alice'
}

sf.shutdown()
sf.init(address="ray cluster head", cluster_config=cluster_config)

import spu

cluster_conf = {
    "nodes": [
        {
            "party": "alice",
            "address": "alice的SPU端口"
        },
        {
            "party": "bob",
            "address": "bob的SPU端口"
        },
    ],
    "runtime_config": {
        "protocol": spu.spu_pb2.SEMI2K,
        "field": "FM128",
        "fxp_fraction_bits": 40,
        "sigmoid_mode": spu.spu_pb2.RuntimeConfig.SIGMOID_REAL,
    },
}
spu = sf.SPU(
    cluster_def=cluster_conf,
    link_desc={
        "connect_retry_times": 60,
        "connect_retry_interval_ms": 1000
    },
)

(2) 封装纵向数据集

alice方拥有前15个特征
bob方拥有其他特征

from secretflow.data import FedNdarray, PartitionWay

x, y = X_train, df_train["PurePremium"]
w = df_train["Exposure"]


def x_to_vdata(x):
    x = x.todense()
    v_data = FedNdarray(
        partitions={
            alice: alice(lambda: x[:, :15])(),
            bob: bob(lambda: x[:, 15:])(),
        },
        partition_way=PartitionWay.VERTICAL,
    )
    return v_data


v_data = x_to_vdata(x)

label_data = FedNdarray(
    partitions={alice: alice(lambda: y.values)()},
    partition_way=PartitionWay.VERTICAL,
)

sample_weight = FedNdarray(
    partitions={alice: alice(lambda: w.values)()},
    partition_way=PartitionWay.VERTICAL,
)

(3) 使用SSGLM

定义用于reveal结果的类：用于reveal预测结果和计算得分

from secretflow.device.driver import reveal
from secretflow.ml.linear.ss_glm.core import get_dist

dist = 'Tweedie'
ss_glm_power = 1.9


class DirectRevealModel:
    def __init__(self, model) -> None:
        self.model = model

    def predict(self, X):
        vdata = x_to_vdata(X)
        y = self.model.predict(vdata)
        return reveal(y).reshape((-1,))

    def score(self, X, y, sample_weight=None):
        y = y.values
        y_pred = self.predict(X)

        constant = np.mean(y)
        if sample_weight is not None:
            constant *= sample_weight.shape[0] / np.sum(sample_weight)

        # Missing factor of 2 in deviance cancels out.
        deviance = get_dist(dist, 1, ss_glm_power).deviance(y_pred, y, None)
        deviance_null = get_dist(dist, 1, ss_glm_power).deviance(
            np.average(y, weights=sample_weight) + np.zeros(y.shape), y, None
        )
        return 1 - (deviance + constant) / (deviance_null + constant)

训练
设置迭代epochs=10

import time
from secretflow.ml.linear.ss_glm import SSGLM

model = SSGLM(spu)

ss_glm_power = 1.9
start = time.time()
model.fit_irls(
    v_data,
    label_data,
    None,
    sample_weight,
    10,
    'Log',
    'Tweedie',
    ss_glm_power,
    l2_lambda=0.1
)

wrapped_model = DirectRevealModel(model)

我本地运行了46s，输出端会显示对应的日志：

2024-06-19 06:36:06.060 INFO model.py:936 [bob] -- [Anonymous_job] epoch 8 train times: 3.8041610717773438s
2024-06-19 06:36:06.061 INFO model.py:586 [bob] -- [Anonymous_job] irls calculating partials...
2024-06-19 06:36:09.817 INFO model.py:617 [bob] -- [Anonymous_job] irls updating weights...
2024-06-19 06:36:09.893 INFO model.py:936 [bob] -- [Anonymous_job] epoch 9 train times: 3.832016706466675s
2024-06-19 06:36:09.895 INFO model.py:586 [bob] -- [Anonymous_job] irls calculating partials...
2024-06-19 06:36:13.513 INFO model.py:617 [bob] -- [Anonymous_job] irls updating weights...
2024-06-19 06:36:13.583 INFO model.py:936 [bob] -- [Anonymous_job] epoch 10 train times: 3.688307523727417s

查看特征权重
```
reveal(model.spu_w)
```
特征比较多，这里就不展示对应的输出了。

(4) 模型评估

和明文结果进行对比

tweedie_powers = [1.5, 1.7, 1.8, 1.9, 1.99, 1.999, 1.9999]

scores_ss_glm_pure_premium = score_estimator(
    wrapped_model,
    X_train,
    X_test,
    df_train,
    df_test,
    target="PurePremium",
    weights="Exposure",
    tweedie_powers=tweedie_powers,
)

scores = pd.concat(
    [scores_glm_pure_premium, scores_ss_glm_pure_premium],
    axis=1,
    sort=True,
    keys=("TweedieRegressor", "SSGLMRegressor"),
)
print("Evaluation of the Tweedie Regressor and SS GLM on target PurePremium")
with pd.option_context("display.expand_frame_repr", False):
    print(scores)

在这里插入图片描述

(5) 可视化

绘制洛伦兹曲线，并计算不同模型的Gini指数以评估模型的性能。洛伦兹曲线用于展示累积的索赔金额分布相对于累积的被保险人数量，通过Gini指数可以衡量模型的区分能力。

from sklearn.metrics import auc


def lorenz_curve(y_true, y_pred, exposure):
    y_true, y_pred = np.asarray(y_true), np.asarray(y_pred)
    exposure = np.asarray(exposure)

    # order samples by increasing predicted risk:
    ranking = np.argsort(y_pred)
    ranked_exposure = exposure[ranking]
    ranked_pure_premium = y_true[ranking]
    cumulated_claim_amount = np.cumsum(ranked_pure_premium * ranked_exposure)
    cumulated_claim_amount /= cumulated_claim_amount[-1]
    cumulated_samples = np.linspace(0, 1, len(cumulated_claim_amount))
    return cumulated_samples, cumulated_claim_amount


fig, ax = plt.subplots(figsize=(8, 8))

y_pred_total_ss_glm = wrapped_model.predict(X_test).reshape((-1,))
y_pred_total = glm_pure_premium.predict(X_test)

for label, y_pred in [
    ("Compound Poisson Gamma", y_pred_total),
    ("Compound Poisson Gamma SS GLM", y_pred_total_ss_glm),
]:
    ordered_samples, cum_claims = lorenz_curve(
        df_test["PurePremium"], y_pred, df_test["Exposure"]
    )
    gini = 1 - 2 * auc(ordered_samples, cum_claims)
    label += " (Gini index: {:.3f})".format(gini)
    ax.plot(ordered_samples, cum_claims, linestyle="-", label=label)

# Oracle model: y_pred == y_test
ordered_samples, cum_claims = lorenz_curve(
    df_test["PurePremium"], df_test["PurePremium"], df_test["Exposure"]
)
gini = 1 - 2 * auc(ordered_samples, cum_claims)
label = "Oracle (Gini index: {:.3f})".format(gini)
ax.plot(ordered_samples, cum_claims, linestyle="-.", color="gray", label=label)

# Random baseline
ax.plot([0, 1], [0, 1], linestyle="--", color="black", label="Random baseline")
ax.set(
    title="Lorenz Curves",
    xlabel="Fraction of policyholders\n(ordered by model from safest to riskiest)",
    ylabel="Fraction of total claim amount",
)
ax.legend(loc="upper left")
plt.plot()