terraform-provider-azurerm机器学习工作区:端到端ML生命周期
你是否正在寻找一种高效管理Azure机器学习工作流的方法?是否希望通过代码化方式实现机器学习资源的部署与维护?本文将带你使用terraform-provider-azurerm构建完整的机器学习生命周期,从环境配置到模型部署,全程自动化管理,让你的ML项目更高效、更可靠。
什么是terraform-provider-azurerm
terraform-provider-azurerm是Azure资源管理器(Azure Resource Manager, ARM)的Terraform提供程序(Provider),它允许用户通过声明式代码定义和管理Azure云资源。通过Terraform的基础设施即代码(Infrastructure as Code, IaC)方式,你可以版本化、自动化和重用你的云资源配置,极大地提高工作效率并减少人为错误。
核心优势
- 声明式配置:只需描述目标状态,Terraform会自动计算并执行必要的操作
- 状态管理:维护资源状态文件,跟踪资源变更历史
- 模块化设计:支持将配置组织为模块,便于重用和共享
- 并行部署:自动处理资源依赖关系,并行部署独立资源
- 多云支持:与其他云提供商的Terraform提供程序兼容,便于多云管理
环境准备
在开始使用terraform-provider-azurerm管理机器学习工作区之前,需要准备以下环境:
必备工具
- Terraform CLI(0.13+版本)
- Azure CLI
- Git
安装步骤
-
安装Terraform CLI:从Terraform官方网站下载并安装适合你操作系统的Terraform CLI
-
安装Azure CLI:从Azure CLI官方文档安装Azure CLI
-
配置Azure认证:
az login -
克隆项目仓库:
git clone https://gitcode.com/GitHub_Trending/te/terraform-provider-azurerm cd terraform-provider-azurerm
机器学习工作区架构
Azure机器学习工作区(Machine Learning Workspace)是Azure机器学习的核心资源,它整合了各种ML资源和工具,提供端到端的机器学习生命周期管理。
核心组件
一个完整的Azure机器学习工作区环境通常包含以下组件:
- 机器学习工作区:核心资源,所有其他资源的组织中心
- 计算资源:用于模型训练和推理的计算实例和集群
- 存储账户:存储数据集、模型和其他文件
- 密钥保管库:安全存储访问密钥、密码和其他敏感信息
- 应用程序 insights:监控和诊断ML工作流
- 容器注册表:存储模型部署所需的Docker镜像
架构图
创建机器学习工作区
使用terraform-provider-azurerm创建Azure机器学习工作区是一个简单而强大的过程。下面我们将通过一个完整的示例来展示如何定义和部署一个ML工作区环境。
基础配置
首先,创建一个新的Terraform配置文件main.tf,并添加以下内容:
# 配置Azure Provider
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~>3.0"
}
}
}
provider "azurerm" {
features {}
}
# 创建资源组
resource "azurerm_resource_group" "ml_rg" {
name = "ml-resource-group"
location = "eastus"
}
完整工作区定义
接下来,添加机器学习工作区及其依赖资源的定义:
# 创建存储账户
resource "azurerm_storage_account" "ml_storage" {
name = "mlstorageaccount"
resource_group_name = azurerm_resource_group.ml_rg.name
location = azurerm_resource_group.ml_rg.location
account_tier = "Standard"
account_replication_type = "GRS"
account_kind = "StorageV2"
}
# 创建密钥保管库
resource "azurerm_key_vault" "ml_keyvault" {
name = "ml-keyvault"
location = azurerm_resource_group.ml_rg.location
resource_group_name = azurerm_resource_group.ml_rg.name
tenant_id = data.azurerm_client_config.current.tenant_id
soft_delete_retention_days = 7
purge_protection_enabled = false
sku_name = "standard"
access_policy {
tenant_id = data.azurerm_client_config.current.tenant_id
object_id = data.azurerm_client_config.current.object_id
key_permissions = [
"Get", "List", "Create", "Delete", "Recover", "Backup", "Restore"
]
secret_permissions = [
"Get", "List", "Set", "Delete", "Recover", "Backup", "Restore"
]
}
}
# 创建应用程序洞察
resource "azurerm_application_insights" "ml_insights" {
name = "ml-application-insights"
location = azurerm_resource_group.ml_rg.location
resource_group_name = azurerm_resource_group.ml_rg.name
application_type = "web"
}
# 创建容器注册表
resource "azurerm_container_registry" "ml_acr" {
name = "mlcontainerregistry"
location = azurerm_resource_group.ml_rg.location
resource_group_name = azurerm_resource_group.ml_rg.name
sku = "Basic"
}
# 创建机器学习工作区
resource "azurerm_machine_learning_workspace" "ml_workspace" {
name = "ml-workspace"
location = azurerm_resource_group.ml_rg.location
resource_group_name = azurerm_resource_group.ml_rg.name
application_insights_id = azurerm_application_insights.ml_insights.id
container_registry_id = azurerm_container_registry.ml_acr.id
key_vault_id = azurerm_key_vault.ml_keyvault.id
storage_account_id = azurerm_storage_account.ml_storage.id
sku_name = "Basic"
identity {
type = "SystemAssigned"
}
}
# 获取当前客户端配置
data "azurerm_client_config" "current" {}
部署工作区
完成配置后,运行以下命令部署ML工作区:
# 初始化Terraform
terraform init
# 预览部署计划
terraform plan
# 应用部署
terraform apply
配置计算资源
机器学习工作流需要各种计算资源来进行模型训练和推理。下面我们将展示如何配置常见的计算资源。
计算实例
计算实例是完全托管的云工作站,适用于交互式开发和测试:
resource "azurerm_machine_learning_compute_instance" "ml_compute_instance" {
name = "ml-compute-instance"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
vm_size = "STANDARD_DS3_V2"
identity {
type = "SystemAssigned"
}
storage_mount {
source = azurerm_storage_account.ml_storage.id
mount_point = "/mnt/azureml"
share_name = "code"
}
}
计算集群
计算集群是可扩展的计算资源,适用于大规模训练作业:
resource "azurerm_machine_learning_compute_cluster" "ml_compute_cluster" {
name = "ml-compute-cluster"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
vm_size = "STANDARD_DS3_V2"
vm_priority = "Dedicated"
scale_min_nodes = 0
scale_max_nodes = 4
scale_node_idle_time = "120" # 2 minutes
identity {
type = "SystemAssigned"
}
}
数据管理
高效的数据管理是机器学习项目成功的关键。下面我们将展示如何使用Terraform管理ML数据资源。
数据集
数据集是Azure ML中的核心数据抽象,用于管理训练和评估数据:
resource "azurerm_machine_learning_dataset" "ml_dataset" {
name = "ml-dataset"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
dataset_type = "Tabular"
tabular_dataset {
path {
datastore_id = azurerm_machine_learning_datastore.ml_datastore.id
path = "data/training_data.csv"
}
}
}
resource "azurerm_machine_learning_datastore" "ml_datastore" {
name = "ml-datastore"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
type = "AzureBlob"
azure_blob_datastore {
storage_account_name = azurerm_storage_account.ml_storage.name
container_name = "data"
credentials {
account_key = azurerm_storage_account.ml_storage.primary_access_key
}
}
}
模型训练与部署
完成工作区和计算资源配置后,我们可以开始定义模型训练和部署流程。
训练配置
使用Terraform定义训练作业的环境和参数:
resource "azurerm_machine_learning_environment" "ml_environment" {
name = "ml-environment"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
docker {
image = "mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:20210615.v1"
}
conda_file = <<EOF
name: ml-environment
channels:
- defaults
- anaconda
- conda-forge
dependencies:
- python=3.8
- scikit-learn=0.24.2
- pandas=1.3.0
- numpy=1.21.0
- pip=21.0.1
- pip:
- azureml-sdk==1.34.0
EOF
}
resource "azurerm_machine_learning_job" "ml_training_job" {
name = "ml-training-job"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
experiment_name = "ml-experiment"
compute_id = azurerm_machine_learning_compute_cluster.ml_compute_cluster.id
environment_id = azurerm_machine_learning_environment.ml_environment.id
command = "python train.py --data-path ${{inputs.data}} --model-path ${{outputs.model}}"
input {
name = "data"
path = azurerm_machine_learning_dataset.ml_dataset.id
}
output {
name = "model"
path = "${azurerm_storage_account.ml_storage.primary_blob_endpoint}models/"
}
distribution {
type = "Mpi"
mpi {
process_count_per_instance = 1
}
}
resource_configuration {
instance_count = 2
instance_type = "STANDARD_DS3_V2"
max_epochs = 10
}
}
模型部署
训练完成后,将模型部署为Web服务:
resource "azurerm_machine_learning_model" "ml_model" {
name = "ml-model"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
description = "Sample ML model"
model_uri = "${azurerm_storage_account.ml_storage.primary_blob_endpoint}models/model.pkl"
model_format = "PICKLE"
model_framework = "SCIKITLEARN"
model_framework_version = "0.24.2"
}
resource "azurerm_machine_learning_inference_endpoint" "ml_endpoint" {
name = "ml-inference-endpoint"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
description = "ML inference endpoint"
}
resource "azurerm_machine_learning_deployment" "ml_deployment" {
name = "ml-deployment"
inference_endpoint_id = azurerm_machine_learning_inference_endpoint.ml_endpoint.id
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
compute_type = "Aci"
model_id = azurerm_machine_learning_model.ml_model.id
environment_id = azurerm_machine_learning_environment.ml_environment.id
aci_config {
cpu_cores = 1
memory_gb = 2
instance_count = 1
}
code_configuration {
code_uri = "${azurerm_storage_account.ml_storage.primary_blob_endpoint}code/"
scoring_script = "score.py"
}
}
完整示例与最佳实践
项目结构
推荐的ML工作区Terraform项目结构如下:
ml-workspace/
├── main.tf # 主配置文件
├── variables.tf # 变量定义
├── outputs.tf # 输出定义
├── terraform.tfvars # 变量值
├── modules/ # 可重用模块
│ ├── workspace/ # ML工作区模块
│ ├── compute/ # 计算资源模块
│ └── deployment/ # 部署模块
└── examples/ # 示例配置
└── basic/ # 基础示例
变量与输出
使用变量和输出来提高配置的灵活性和可维护性:
variables.tf
variable "resource_group_name" {
description = "Name of the resource group"
type = string
default = "ml-resource-group"
}
variable "location" {
description = "Azure region"
type = string
default = "eastus"
}
variable "workspace_name" {
description = "Name of the ML workspace"
type = string
default = "ml-workspace"
}
variable "sku_name" {
description = "SKU for the ML workspace"
type = string
default = "Basic"
}
outputs.tf
output "workspace_id" {
description = "The ID of the ML workspace"
value = azurerm_machine_learning_workspace.ml_workspace.id
}
output "workspace_url" {
description = "The URL of the ML workspace"
value = azurerm_machine_learning_workspace.ml_workspace.workspace_url
}
output "endpoint_url" {
description = "The URL of the inference endpoint"
value = azurerm_machine_learning_inference_endpoint.ml_endpoint.scoring_uri
}
管理与监控
工作区监控
配置工作区监控以跟踪资源使用情况和性能:
resource "azurerm_monitor_diagnostic_setting" "ml_workspace_monitoring" {
name = "ml-workspace-monitoring"
target_resource_id = azurerm_machine_learning_workspace.ml_workspace.id
storage_account_id = azurerm_storage_account.ml_storage.id
log_analytics_workspace_id = azurerm_log_analytics_workspace.ml_log_analytics.id
log {
category = "WorkspaceLogs"
enabled = true
retention_policy {
enabled = true
days = 30
}
}
metric {
category = "AllMetrics"
enabled = true
retention_policy {
enabled = true
days = 30
}
}
}
resource "azurerm_log_analytics_workspace" "ml_log_analytics" {
name = "ml-log-analytics"
location = azurerm_resource_group.ml_rg.location
resource_group_name = azurerm_resource_group.ml_rg.name
sku_name = "PerGB2018"
retention_in_days = 30
}
成本管理
设置预算和成本警报以控制ML工作区支出:
resource "azurerm_consumption_budget_resource_group" "ml_budget" {
name = "ml-budget"
resource_group_name = azurerm_resource_group.ml_rg.name
amount = 1000
currency = "USD"
time_grain = "Monthly"
start_date = formatdate("YYYY-MM-01", timestamp())
end_date = formatdate("YYYY-MM-01", timeadd(timestamp(), "12months"))
alert {
name = "ml-budget-alert"
operator = "GreaterThanOrEqual"
threshold = 80
threshold_type = "Percentage"
contact_email_addresses = ["admin@example.com"]
}
}
进阶技巧
工作区升级
随着项目发展,你可能需要将工作区从Basic SKU升级到Enterprise SKU以获得更多高级功能:
resource "azurerm_machine_learning_workspace" "ml_workspace" {
# ... 其他配置保持不变
sku_name = "Enterprise"
# 启用高级功能
hbi_workspace = true
public_network_access_enabled = false
private_endpoint_connection {
name = "ml-workspace-private-endpoint"
private_endpoint_id = azurerm_private_endpoint.ml_workspace_pe.id
private_link_service_connection_state {
status = "Approved"
description = "Private endpoint approved"
}
}
}
工作区恢复与备份
配置工作区备份以防止数据丢失:
resource "azurerm_machine_learning_workspace_backup" "ml_workspace_backup" {
name = "ml-workspace-backup"
machine_learning_workspace_id = azurerm_machine_learning_workspace.ml_workspace.id
frequency = "Daily"
retention_period = 30
storage_account_id = azurerm_storage_account.ml_backup_storage.id
backup_window = "02:00-04:00"
}
resource "azurerm_storage_account" "ml_backup_storage" {
name = "mlbackupstorage"
resource_group_name = azurerm_resource_group.ml_rg.name
location = azurerm_resource_group.ml_rg.location
account_tier = "Standard"
account_replication_type = "GRS"
account_kind = "StorageV2"
}
总结
通过本文,你已经学习了如何使用terraform-provider-azurerm来创建、配置和管理Azure机器学习工作区,实现了端到端的ML生命周期管理。从基础环境配置到高级功能应用,Terraform提供了一种一致、可靠且可重复的方式来管理你的ML基础设施。
使用Terraform管理ML工作区的主要优势包括:
- 基础设施即代码:将整个ML环境定义为代码,便于版本控制和协作
- 自动化部署:一键部署完整的ML环境,减少手动操作和错误
- 可重用性:创建模块化配置,在多个项目中重用
- 一致性:确保开发、测试和生产环境的一致性
- 可扩展性:轻松扩展和调整ML环境以满足不断变化的需求
后续步骤
- 深入学习Terraform模块,创建可重用的ML工作区模块
- 探索Azure机器学习管道,实现端到端的ML工作流自动化
- 学习如何使用Azure Policy来管理和治理ML资源
- 尝试使用Terraform Cloud或Azure DevOps实现CI/CD管道,自动化ML模型部署
希望本文能够帮助你更好地利用terraform-provider-azurerm来管理你的Azure机器学习项目。如果你有任何问题或建议,请随时在评论区留言。
别忘了点赞、收藏和关注,以获取更多关于Azure和Terraform的实用教程!
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



