基于Azure Machine Learning和MLflow训练与部署PyTorch图像分类模型-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00903/article/details/148548746

基于Azure Machine Learning和MLflow训练与部署PyTorch图像分类模型

MachineLearningNotebooks Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft 项目地址: https://gitcode.com/gh_mirrors/ma/MachineLearningNotebooks

前言

在现代机器学习工作流中，实验跟踪和模型部署是两个至关重要的环节。本文将详细介绍如何利用Azure Machine Learning服务和MLflow框架来训练一个PyTorch图像分类模型，并最终将其部署为可用的Web服务。通过本教程，您将掌握从本地开发到云端部署的完整流程。

环境准备

在开始之前，请确保您已满足以下条件：

已创建Azure Machine Learning工作区
已安装必要的Python包：
- PyTorch 1.4或更高版本
- azureml-mlflow包（通过pip install azureml-mlflow安装）
- 其他依赖项：Pillow等图像处理库

核心概念解析

MLflow与Azure Machine Learning集成

MLflow是一个开源的机器学习生命周期管理平台，而Azure Machine Learning提供了企业级的机器学习服务。两者的集成带来了以下优势：

统一的实验跟踪界面
自动化的模型注册和管理
简化的部署流程

PyTorch模型训练要点

本教程使用经典的MNIST手写数字数据集，构建一个卷积神经网络(CNN)分类器。关键训练参数包括：

学习率：0.01
批次大小：64
训练轮次：5
优化器：Adam

详细实现步骤

1. 初始化工作区连接

首先，我们需要连接到Azure Machine Learning工作区：

import mlflow
import azureml.core
from azureml.core import Workspace

# 检查版本
print("SDK版本:", azureml.core.VERSION)
print("MLflow版本:", mlflow.version.VERSION)

# 连接工作区
ws = Workspace.from_config()

2. 配置MLflow跟踪

将MLflow的跟踪URI设置为指向Azure ML工作区：

mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri())

3. 创建实验

在Azure ML中，实验是组织训练运行的基本单位：

experiment_name = "pytorch-with-mlflow"
mlflow.set_experiment(experiment_name)

4. 本地训练与指标跟踪

训练脚本train.py包含完整的训练逻辑，关键MLflow集成点包括：

# 在训练脚本中
with mlflow.start_run():
    # 训练过程...
    mlflow.log_metric("train_loss", loss.item())
    mlflow.log_metric("train_acc", accuracy)
    # 保存模型
    mlflow.pytorch.save_model(model, "model")

执行本地训练：

import sys, os
sys.path.append(os.path.abspath("scripts"))
import train

run = train.driver()

5. 云端GPU训练

为加速训练，我们可以将任务提交到Azure GPU计算集群：

from azureml.core import Environment, ScriptRunConfig

# 准备环境
env = Environment.get(workspace=ws, name="azureml-acpt-pytorch-1.11-cuda11.3").clone("mlflow-env")
env.python.conda_dependencies.add_pip_package("azureml-mlflow")
env.python.conda_dependencies.add_pip_package("Pillow==6.0.0")

# 配置训练任务
src = ScriptRunConfig(source_directory="./scripts", script="train.py")
src.run_config.environment = env
src.run_config.target = "gpu-cluster"

# 提交任务
from azureml.core import Experiment
exp = Experiment(ws, experiment_name)
run = exp.submit(src)
run.wait_for_completion(show_output=True)

6. 模型部署

训练完成后，我们可以将模型部署为Web服务：

from mlflow.deployments import get_deploy_client
import json

# 准备部署配置
deploy_config = {"computeType": "aci"}  # 使用Azure容器实例
with open("deployment_config.json", "w") as f:
    json.dump(deploy_config, f)

# 创建部署
client = get_deploy_client(mlflow.get_tracking_uri())
client.create_deployment(
    model_uri=f"runs:/{run.id}/model",
    config={"deploy-config-file": "deployment_config.json"},
    name="pytorch-aci-deployment"
)