使用Azure Machine Learning和MLflow进行远程训练实验跟踪

原创于 2025-06-10 09:00:57 发布 · 302 阅读

CC 4.0 BY-SA版权

使用Azure Machine Learning和MLflow进行远程训练实验跟踪

MachineLearningNotebooks Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft 项目地址: https://gitcode.com/gh_mirrors/ma/MachineLearningNotebooks

概述

在机器学习项目中，实验跟踪是至关重要的环节。本文将介绍如何利用Azure Machine Learning服务和MLflow跟踪API来记录远程训练运行的指标和工件。通过这种组合，数据科学家可以轻松地管理和比较不同实验的结果，同时充分利用Azure云平台的计算资源。

准备工作

在开始之前，需要确保已完成以下准备工作：

已配置Azure Machine Learning工作区
已安装最新版本的Azure ML SDK
拥有足够的权限创建计算资源

环境设置

首先，我们需要验证并设置基础环境：

import azureml.core
from azureml.core import Workspace, Experiment

# 检查SDK版本
print("SDK版本:", azureml.core.VERSION)

# 连接工作区
ws = Workspace.from_config()

创建计算集群

为了进行远程训练，我们需要设置一个计算集群。Azure Machine Learning提供了灵活的计算资源配置选项：

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

cluster_name = "cpu-cluster"

try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print("找到现有计算集群")
except ComputeTargetException:
    print("创建新计算集群")
    
    # 配置集群参数
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_D2_V2",
        min_nodes=0,
        max_nodes=2
    )

    # 创建集群
    cpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config)
    cpu_cluster.wait_for_completion(show_output=True)

创建实验

在Azure ML中，实验是组织相关运行的基本单元：

experiment_name = "RemoteTrain-with-mlflow-sample"
exp = Experiment(workspace=ws, name=experiment_name)

训练脚本分析

我们使用一个糖尿病数据集训练回归模型的示例脚本。关键点在于脚本中使用MLflow API进行日志记录：

# train_diabetes.py示例内容
"""
import mlflow
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# 启动MLflow运行
with mlflow.start_run():
    # 加载数据
    diabetes = np.loadtxt('diabetes.csv', delimiter=',')
    X = diabetes[:,0:-1]
    y = diabetes[:,-1]
    
    # 分割数据集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    
    # 训练模型
    alpha = 0.5
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    
    # 评估模型
    preds = model.predict(X_test)
    mse = mean_squared_error(y_test, preds)
    mae = mean_absolute_error(y_test, preds)
    r2 = r2_score(y_test, preds)
    
    # 记录参数和指标
    mlflow.log_param("alpha", alpha)
    mlflow.log_metric("mse", mse)
    mlflow.log_metric("mae", mae)
    mlflow.log_metric("r2", r2)
    
    # 保存模型
    mlflow.sklearn.log_model(model, "model")
    
    # 创建并保存图表
    plt.scatter(y_test, preds)
    plt.xlabel("实际值")
    plt.ylabel("预测值")
    plt.savefig("scatter.png")
    mlflow.log_artifact("scatter.png")
"""

配置运行环境

为了确保远程计算节点上有所有必要的依赖项，我们需要创建一个环境：

from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

env = Environment(name="mlflow-env")

# 指定conda依赖项
cd = CondaDependencies.create(
    conda_packages=["scikit-learn", "matplotlib"],
    pip_packages=["azureml-mlflow", "pandas", "numpy"]
)

env.python.conda_dependencies = cd

配置并提交运行

现在我们可以配置脚本运行并将其提交到计算集群：

from azureml.core import ScriptRunConfig

src = ScriptRunConfig(
    source_directory=".",
    script="train_diabetes.py",
    compute_target=cpu_cluster,
    environment=env
)

# 提交运行
run = exp.submit(src)
run.wait_for_completion(show_output=True)

查看运行结果

运行完成后，可以通过多种方式查看结果：

在Azure门户中查看：导航到Azure ML工作区查看完整的运行指标和工件
在笔记本中获取指标：

# 获取指标
metrics = run.get_metrics()
print(metrics)

# 获取运行详情
details = run.get_details()
print(details)

技术要点解析

MLflow与Azure ML集成：当在远程集群上运行时，Azure ML会自动设置MLflow跟踪URI指向工作区，实现无缝集成
自动日志记录：通过MLflow API记录的内容会自动同步到Azure ML工作区
环境隔离：使用Docker环境确保依赖项一致性和可重复性
计算资源弹性：集群可以自动缩放，优化资源利用率

最佳实践建议

参数化实验：使用MLflow记录所有关键参数，便于比较不同配置的结果
工件管理：合理组织保存的模型和图表，保持实验整洁
资源监控：注意计算集群的使用情况，及时调整规模以控制成本
版本控制：将训练脚本和配置纳入版本控制系统

总结

通过结合Azure Machine Learning和MLflow，数据科学家可以：

轻松管理远程训练任务
自动跟踪实验指标和工件
充分利用云计算的弹性资源
保持实验的可重复性和可追溯性

这种集成方案为机器学习项目提供了强大的实验管理和跟踪能力，是构建生产级ML工作流的重要基础。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考