实验跟踪与超参数调整：使用 DVC 组织您的试验-优快云博客

原文：towardsdatascience.com/experiment-tracking-hyperparameter-tuning-organize-your-trials-with-dvc-d17f47f38754

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/0897f3259b27271ff836e2d988a80096.png

使用 Midjourney 生成的图像

在本系列的先前部分，我已经解释了跟踪机器学习实验的好处，并展示了如何使用 DVC 轻松地做到这一点。在本系列中，我们尚未深入探讨的一个方面是超参数调整（HPT）。

虽然我们的某些实验可能涉及更改数据集、代码库、添加或删除特征或修复奇特的错误，但这些数量可能仍然可以管理，因为这些需要我们编写代码或手动执行一些分析。

然而，当我们考虑超参数调整时，这很容易失控。在先前的部分中，我展示了使用建议的设置，我们可以轻松地通过params.yaml文件控制模型的超参数。此外，通过使用 DVC，我们可以通过版本控制该文件轻松地跟踪实验。然而，这仍然需要我们根据我们的专业知识或直觉手动更改超参数。如果我们采用网格搜索等程序，我们可能会对模型进行数千次拟合和评估，每次都使用一组不同的超参数，所有这些都在几行代码中完成。

正因如此，我想向您展示我们如何使用实验跟踪的最佳实践来跟踪作为 HPT 常规一部分的实验。

如果您想复习一下如何使用 DVC，我强烈建议您阅读先前的部分，因为在本部分中我们不会涵盖所有设置细节。以下是可以找到先前文章的部分：

设置

在这个例子中，我们将使用*信用卡客户默认情况*数据集来处理一个样本分类问题。这个数据集包含了台湾信用卡客户的违约支付信息，以及与人口统计因素、信用数据、支付历史和账单声明相关的特征。

由于这是一个非常流行的数据集，我们将跳过探索性分析，专注于构建和调整机器学习模型。

与前几部分类似，我们将使用 DVC 及其 VS Code 扩展。为了保持简单，我们将遵循我们在使用 DVC 进行实验跟踪的最简方法中采取的步骤。因此，我们将专注于跟踪我们的实验并设置它们。然而，我强烈建议您复制系列前几部分中采取的所有步骤，包括设置数据版本控制。

训练脚本

我们的起点将是一个训练脚本。在这个脚本中，我们做了一些事情：

我们加载数据。
我们从params.yaml文件中加载参数。
我们将数据集分为训练集和测试集。
我们使用选定的参数（为了简单起见，我们只考虑以下 3 个：class_weight、max_depth、n_estimators）训练一个随机森林分类器。
我们将训练好的模型存储在models目录中。
我们跟踪诸如准确率、精确率和召回率等指标。此外，我们还存储混淆矩阵和两个图表：ROC 曲线和精确率-召回率曲线。

您可以在下面找到完整的训练脚本。如您所见，它相当直接。它基本上是我们系列前几部分中遵循的相同方法。

import json
from pathlib import Path

import pandas as pd
from dvc.api import params_show
from joblib import dump
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

from dvclive import Live
from src.constants import DATA_RAW_DIR, MODELS_DIR, TARGET

# set the params
train_params = params_show()["train"]["params"]

# load data
X = pd.read_csv(f"{DATA_RAW_DIR}/UCI_Credit_Card.csv", index_col="ID")
y = X.pop(TARGET)

# train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# fit-predict
model = RandomForestClassifier(random_state=42, **train_params)
model.fit(X_train, y_train)

# store the trained model
model_dir = Path(MODELS_DIR)
model_dir.mkdir(exist_ok=True)

dump(model, f"{MODELS_DIR}/model.joblib")

# get predictions
y_pred = model.predict(X_test)
y_pred_prob = model.predict_proba(X_test)[:, 1]

# tracking the metrics
with Live(save_dvc_exp=True) as live:

    live.log_sklearn_plot("confusion_matrix", y_test, y_pred)
    live.log_sklearn_plot("roc", y_test, y_pred_prob)
    live.log_sklearn_plot("precision_recall", y_test, y_pred_prob)

    metrics = {
        "accuracy": round(accuracy_score(y_test, y_pred), 4),
        "recall": round(recall_score(y_test, y_pred), 4),
        "precision": round(precision_score(y_test, y_pred), 4),
    }

    json.dump(obj=metrics, fp=open("metrics.json", "w"), indent=4, sort_keys=True)

使用 DVC 的 HPT 基础

正如我在前几部分中提到的，使用 DVC 手动运行实验非常简单。我们的params.yaml文件包含了我们的 RF 模型使用的配置。

train:
  params:
    class_weight: balanced
    max_depth: 5
    n_estimators: 10

要运行涉及更改这些值的实验，我们可以在我们的终端中运行以下命令：

dvc exp run --set-param train.params.n_estimators=100

或者，我们可以使用 VS Code 扩展，通过 GUI 中的提示来完成相同的工作。

不幸的是，那种方法扩展性不好。现在，我们将探讨替代方法以确保我们可以探索广泛的超参数范围。

简单的网格搜索

作为我们的第一个方法，我们将进行简单的（穷举）网格搜索。也就是说，我们将定义一个可能的超参数网格，并使用所有可能的组合运行训练脚本。为了简化，我们将测试n_estimators和max_depth的 2 个值。

在下面的脚本中，我们以更 Pythonic 的方式执行上述提到的命令。首先，我们实例化一个Repo对象。然后，我们使用它来运行实验。如您所见，我们使用与 CLI 中相同的方法来指定超参数。

总结来说，我们多次执行run方法，每次使用不同的超参数集。

import itertools
from dvc.repo import Repo

repo = Repo(".")

# hp grid
n_estimators_grid = [10, 20]
max_depth_grid = [5, 10]

for n_est, max_depth in itertools.product(n_estimators_grid, max_depth_grid):
    repo.experiments.run(
        queue=True,
        params=[
            f"train.params.n_estimators={n_est}",
            f"train.params.max_depth={max_depth}",
        ],
    )

您可能已经注意到，添加了queue标志。通过包含它，我们不是立即运行实验。我们正在队列中安排它们。当那些实验可能需要很长时间才能运行，并且我们希望在启动之前再次检查它们的设置时，这很方便。当我们导航到 VS Code 扩展的实验标签时，我们可以看到安排的实验（其名称旁边有一个时钟图标），以及选定的超参数。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/ce15bfba591ddd8b0d2110ba2968b481.png

一旦我们对设置满意，我们可以使用以下命令启动队列：

dvc queue start

实验执行后，性能指标将在表中填充：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/601ab1954f0d5ee018b03e20c0f3eab1.png

随机网格搜索

我们不是通过穷举搜索所有可能的超参数组合，而是从参数空间中采样几个随机组合的超参数。这种方法在处理大型超参数空间时特别有用，因为它可以显著降低计算成本，同时仍然提供良好的性能。

我们在以下脚本中执行随机搜索。正如您所看到的，对于n_estimators，我们在 10 到 100 的范围内采样随机整数。对于max_depth，我们从 4 个预定的值中选择一个。我们这样做是为了说明目的，只是为了展示我们可以以多种方式选择超参数。

再次强调，我们正在队列中安排实验，然后手动启动它们。

import random
from dvc.repo import Repo

repo = Repo(".")

random.seed(0)
N_EXPERIMENTS = 5

for _ in range(N_EXPERIMENTS):

    n_est = random.randint(10, 100)
    max_depth = random.choice([5, 10, 15, 20])

    repo.experiments.run(
        queue=True,
        params=[
            f"train.params.n_estimators={n_est}",
            f"train.params.max_depth={max_depth}",
        ],
    )

下面您可以看到随机搜索的结果。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/c89127b1bda014710af27623657af7f1.png

比较不仅仅是分数

到目前为止，我们已经使用了两种不同的超参数调整方法。使用实验标签，我们可以比较所有探索组合的性能指标。然而，因为我们还希望在训练脚本中跟踪一些图表，所以我们很容易深入分析，以获得更全面的了解。

要做到这一点，让我们从每个考虑的网格搜索方法中挑选出表现最好的模型（就召回率而言）。然后，我们将它们用它们名称左侧的小图表图标标记。通过这样做，我们表明我们想要调查哪些实验。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/f20a7983611bea2586144aab807277d3.png

在完成这些之后，我们导航到图表标签。在那里，我们可以检查我们想要跟踪的交互式图表。使用这种方法，例如，我们可以通过分析曲线来深入了解精确度和召回率，而不仅仅是查看两组数字。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/5f9206b1f4260a708bd427121390553c.png

自然地，我们一次可以比较多个实验！

在进入下一部分之前，也值得提一下，我们可以轻松地使用 VS Code 扩展从实验过程中存储的指标创建自定义图表。为此，我们应该滚动到“图表”标签页的底部，在那里我们可以找到以下面板：

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/ca6364d85b90e91b01620f6d1dfbbad3.png

点击“添加图表”按钮后，我们将看到一个弹出窗口，它将引导我们完成创建自定义图表的过程。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/b8c71c448e7e6386373ad7bae150ca57.png

让我们选择“自定义”选项，然后对于图表内容，让我们选择 max_depth 超参数和跟踪的召回率分数。

选择这些后，我们可以看到以下图表，它显示了我们所运行的最近的所有实验。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/6da3379f50aae548150d017425facf0d.png

使用自定义图表是探索“实验”标签页中表格内容的另一种交互式方法。

使用 Optuna 的高级 HPT

在你之前可能想过：“不错，但还有更高级的 HPT 方法，例如贝叶斯网格搜索。我们能做吗？”答案是：我们可以！DVC 与 Optuna 集成，Optuna 是最受欢迎的 Python HPT 库之一。

要使用 Optuna 的 HPT 流程，我们必须修改我们的训练脚本。第一个区别是这次我们将使用 3 组：训练、验证和测试。训练和验证将用于 HPT。在获得最佳超参数组合后，我们将再次使用训练 + 验证集训练模型，并使用测试集进行预测。严格来说，这不是必须的，但这也是测试在超参数调整阶段是否发生过拟合的方法之一。

为了适应 Optuna 的代码，我们首先创建一个目标函数，该函数使用所选的超参数训练 RF 模型，并在验证集上返回召回率分数。然后，我们创建一个 Optuna 研究，并指出我们想要最大化目标函数（在这种情况下，召回率）。

正如我已经提到的，DVC 与 Optuna 集成。为了跟踪 HPT 流程的所有试验，我们只需将 DVC 回调 (DVCLiveCallback) 添加到 Optuna 的 optimize 方法中。就这么简单！

import json
from pathlib import Path

import optuna
import pandas as pd
import yaml
from dvclive.optuna import DVCLiveCallback
from joblib import dump
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import train_test_split

from dvclive import Live
from src.constants import DATA_RAW_DIR, MODELS_DIR, TARGET

# define the objective function for Optuna
def objective(trial):
    # search space
    n_estimators = trial.suggest_int("n_estimators", 10, 100)
    max_depth = trial.suggest_int("max_depth", 2, 32)
    class_weight = trial.suggest_categorical(
        "class_weight", [None, "balanced", "balanced_subsample"]
    )

    # define and train the RF model with the suggested parameters
    clf = RandomForestClassifier(
        n_estimators=n_estimators, max_depth=max_depth, class_weight=class_weight
    )
    clf.fit(X_train, y_train)

    # Calculate recall on the validation set
    y_pred = clf.predict(X_valid)
    recall = recall_score(y_valid, y_pred)
    return recall

# load data
X = pd.read_csv(f"{DATA_RAW_DIR}/UCI_Credit_Card.csv", index_col="ID")
y = X.pop(TARGET)

# train-valid-test split
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train, X_valid, y_train, y_valid = train_test_split(
    X_temp, y_temp, test_size=0.2, random_state=42, stratify=y_temp
)

# Create the Optuna study and optimize the objective function
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10, callbacks=[DVCLiveCallback()])

# Get the best parameters
best_params = study.best_params
print("Best Parameters:", best_params)

best_params_dict = {"train": {"params": best_params}}

# save the best parameters
with open("params.yaml", "w") as file:
    yaml.dump(best_params_dict, file, default_flow_style=False)

# Train the RandomForestClassifier with the best parameters
best_clf = RandomForestClassifier(**best_params)
best_clf.fit(X_temp, y_temp)

# store the trained model
model_dir = Path(MODELS_DIR)
model_dir.mkdir(exist_ok=True)

dump(best_clf, f"{MODELS_DIR}/model.joblib")

y_pred = best_clf.predict(X_test)
y_pred_prob = best_clf.predict_proba(X_test)[:, 1]

with Live(save_dvc_exp=True) as live:

    live.log_sklearn_plot("confusion_matrix", y_test, y_pred)
    live.log_sklearn_plot("roc", y_test, y_pred_prob)
    live.log_sklearn_plot("precision_recall", y_test, y_pred_prob)

    metrics = {
        "accuracy": round(accuracy_score(y_test, y_pred), 4),
        "recall": round(recall_score(y_test, y_pred), 4),
        "precision": round(precision_score(y_test, y_pred), 4),
    }

    json.dump(obj=metrics, fp=open("metrics.json", "w"), indent=4, sort_keys=True)

运行修改后的脚本后，我们将在终端看到一个输出，类似于这个（由于搜索的随机性，值可能不同）：

[I 2024-03-02 23:05:14,954] A new study created in memory with name: no-name-9d71dba3-4f31-48b4-ac7f-ecd9e556138a
[I 2024-03-02 23:05:16,458] Trial 0 finished with value: 0.4726930320150659 and parameters: {'n_estimators': 56, 'max_depth': 15, 'class_weight': 'balanced'}. Best is trial 0 with value: 0.4726930320150659.
[I 2024-03-02 23:05:18,885] Trial 1 finished with value: 0.3455743879472693 and parameters: {'n_estimators': 42, 'max_depth': 32, 'class_weight': 'balanced'}. Best is trial 0 with value: 0.4726930320150659\.               
[I 2024-03-02 23:05:22,870] Trial 2 finished with value: 0.3747645951035782 and parameters: {'n_estimators': 93, 'max_depth': 29, 'class_weight': None}. Best is trial 0 with value: 0.4726930320150659\.                     
[I 2024-03-02 23:05:26,376] Trial 3 finished with value: 0.3559322033898305 and parameters: {'n_estimators': 81, 'max_depth': 32, 'class_weight': 'balanced_subsample'}. Best is trial 0 with value: 0.4726930320150659\.     
[I 2024-03-02 23:05:27,379] Trial 4 finished with value: 0.3267419962335217 and parameters: {'n_estimators': 10, 'max_depth': 28, 'class_weight': 'balanced_subsample'}. Best is trial 0 with value: 0.4726930320150659\.     
[I 2024-03-02 23:05:30,249] Trial 5 finished with value: 0.3662900188323917 and parameters: {'n_estimators': 64, 'max_depth': 29, 'class_weight': None}. Best is trial 0 with value: 0.4726930320150659\.                     
[I 2024-03-02 23:05:33,264] Trial 6 finished with value: 0.4048964218455744 and parameters: {'n_estimators': 70, 'max_depth': 22, 'class_weight': 'balanced_subsample'}. Best is trial 0 with value: 0.4726930320150659\.     
[I 2024-03-02 23:05:35,321] Trial 7 finished with value: 0.3615819209039548 and parameters: {'n_estimators': 45, 'max_depth': 30, 'class_weight': 'balanced'}. Best is trial 0 with value: 0.4726930320150659\.               
[I 2024-03-02 23:05:38,073] Trial 8 finished with value: 0.5028248587570622 and parameters: {'n_estimators': 89, 'max_depth': 13, 'class_weight': 'balanced'}. Best is trial 8 with value: 0.5028248587570622\.               
[I 2024-03-02 23:05:38,920] Trial 9 finished with value: 0.4745762711864407 and parameters: {'n_estimators': 11, 'max_depth': 16, 'class_weight': 'balanced'}. Best is trial 8 with value: 0.5028248587570622\.               
Best Parameters: {'n_estimators': 89, 'max_depth': 13, 'class_weight': 'balanced'}

一旦我们导航到“实验”标签页，表格将比之前复杂一些。

https://github.com/OpenDocCN/towardsdatascience-blog-zh-2024/raw/master/docs/img/5d00d15a01fc9d95ed269c6f724e34b6.png

这是因为我们现在正在跟踪两组参数和指标：

params.yaml 文件的内容以及运行脚本后产生的指标。
来自 Optuna 试验的超参数以及目标函数的结果。

好的，让我们看看表格，并分析我们看到了什么。总共有 11 个实验。其中前 10 个（从底部开始）是 Optuna 的 HPT 流程的 10 次试验。如您所见，第一组超参数和指标值没有变化，只有 Optuna 变体有所变化。

从上面的终端输出中，我们知道最佳试验是第 9 次试验（记得从 0 开始的标记！）。因此，最佳试验是pawky-dabs。

然后，pagan-loup实验是从 Optuna 中获取最佳超参数集并将它们存储在params.yaml中的结果。然后，我们再次使用训练集和验证集训练模型，并在测试集上评分（在 HPT 期间未见过）。这也是为什么最佳试验（pawky-dabs）的召回率分数与pagan-loup实验的召回率分数不匹配的原因。

正如我们之前所做的那样，我们可以通过探索这一组特定试验的图表来进行更深入的分析。

总结

在这篇文章中，我们探讨了三种超参数调优的自动化方法。我们首先通过迭代执行dvc run exp命令来进行穷举和随机网格搜索。之后，我们使用 Optuna 库运行贝叶斯网格搜索。多亏了DVCLiveCallback，我们能够非常容易地跟踪 HPT 流程的所有试验。结合系列前几部分的知识，我们现在可以确保所有实验，包括寻找最佳超参数集，都是完全可重复的。

在一个小插曲中，我最近在 PyData Global 上做了一次关于使用 DVC 进行实验跟踪的演讲。如果您感兴趣，可以在这里查看：

cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2Fzz0Y2XKzWQ8%3Ffeature%3Doembed&display_name=YouTube&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3Dzz0Y2XKzWQ8&image=https%3A%2F%2Fi.ytimg.com%2Fvi%2Fzz0Y2XKzWQ8%2Fhqdefault.jpg&key=a19fcc184b9711e1b4764040d3dc5c07&type=text%2Fhtml&schema=youtube

您可以在这个仓库中找到本文中使用的代码。一如既往，任何建设性的反馈都十分欢迎。您可以通过LinkedIn、Twitter或评论与我联系。

您可能还对以下内容感兴趣：