ci cd管道与azure devops一起用于数据科学项目

最新推荐文章于 2025-12-11 08:58:14 发布

翻译最新推荐文章于 2025-12-11 08:58:14 发布 · 380 阅读

1 ·

CC 4.0 BY-SA版权

原文链接：https://towardsdatascience.com/ci-cd-pipeline-with-azure-devops-for-data-science-project-f263586c266e

文章标签：

#java #python #人工智能 #vue #ViewUI

本文详细介绍如何使用Azure DevOps为机器学习项目建立持续集成和持续交付(CI/CD)管道，涉及Azure SQL、数据工厂、存储帐户V2和Databricks等服务，通过实例演示配置过程。

In this article, I would like to show how to build Continuous Integration and Continuous Delivery pipelines for a Machine Learning project with Azure DevOps.

在本文中，我想展示如何使用Azure DevOps为机器学习项目建立持续集成和持续交付管道。

First of all, let’s define the CI/CD. As Wiki said “CI/CD bridges the gaps between development and operation activities and teams by enforcing automation in building, testing, and deployment of applications. Modern-day DevOps practices involve continuous development, continuous testing, continuous integration, continuous deployment, and continuous monitoring of software applications throughout its development life cycle. The CI/CD practice or CI/CD pipeline forms the backbone of modern-day DevOps operations.”

首先，让我们定义CI / CD。正如Wiki所说，“ CI / CD通过在应用程序的构建，测试和部署中实施自动化来弥合开发，运营活动和团队之间的鸿沟。当今的DevOps实践涉及在其整个开发生命周期中进行持续的开发，持续的测试，持续的集成，持续的部署以及对软件应用程序的持续监视。 CI / CD实践或CI / CD管道构成了现代DevOps操作的骨干。”

Ok, and let’s find out CI and CD separately.

好的，让我们分别查找CI和CD。

Continuous integration is a coding philosophy and set of practices that drive development teams to implement small changes and check-in code to version control repositories frequently. Because most modern applications require developing code in different platforms and tools, the team needs a mechanism to integrate and validate its changes.

持续集成是一种编码哲学和一组实践，可推动开发团队实施小的更改并频繁检入代码以进行版本控制存储库。由于大多数现代应用程序需要在不同的平台和工具上开发代码，因此团队需要一种机制来集成和验证其更改。

Continuous delivery picks up where continuous integration ends. CD automates the delivery of applications to selected infrastructure environments. Most teams work with multiple environments other than the production, such as development and testing environments, and CD ensures there is an automated way to push code changes to them.

持续交付 在持续集成结束的地方接手。 CD可以自动将应用程序交付到选定的基础架构环境。大多数团队都在生产环境以外的其他环境中工作，例如开发和测试环境，并且CD确保有自动方式将代码更改推送到他们。

So, why it is important? Machine Learning applications are becoming popular in our industry, however, the process for developing, deploying, and continuously improving them is more complex compared to more traditional software, such as a web service or a mobile application.

那么，为什么重要呢？机器学习应用程序在我们的行业中变得越来越流行，但是，与更传统的软件(例如Web服务或移动应用程序)相比，开发，部署和持续改进它们的过程更加复杂。

Ok, let’s try to build a simple pipeline for my project. The project is about predictive analytics with the following Azure services: Azure SQL, Azure Data Factory, Azure Storage Account V2, Azure Data Bricks.

好的，让我们尝试为我的项目构建一个简单的管道。该项目涉及具有以下Azure服务的预测分析：Azure SQL，Azure数据工厂，Azure存储帐户V2，Azure数据块。

This solution consists of three steps. The first step — run data ADF Data flow to get data from SQL DB, convert and select several columns format from the table and that save these results to Stage folder in Azure Data Lake. The second step — run Azure Databriks notebook from ADF with specified parameter about storage account to prepare history dataset and run train model. At this step, we can use MLflow rack experiments to record and compare parameters and results. And the last step — run Azure Databriks notebook from ADF also with specified parameter about storage account to score our model and save results to Azure Data Lake.

该解决方案包括三个步骤。第一步-运行数据ADF数据流以从SQL DB获取数据，转换并从表中选择几列格式，然后将这些结果保存到Azure Data Lake中的Stage文件夹。第二步–使用有关存储帐户的指定参数从ADF运行Azure Databriks笔记本，以准备历史数据集并运行训练模型。在这一步，我们可以使用MLflow机架实验来记录和比较参数和结果。最后一步-也使用有关存储帐户的指定参数从ADF运行Azure Databriks笔记本，以对模型进行评分并将结果保存到Azure Data Lake。

To start this project I need to create 3 resource groups in the Azure portal. These resource groups would be responsible for the different environments — Dev, Staging, and Production. In these environments, I created the following services — Azure SQL, Azure Data Factory, Azure Storage Account V2, Azure Data Bricks, and Azure Key Vault.

要启动此项目，我需要在Azure门户中创建3个资源组。这些资源组将负责不同的环境-开发，暂存和生产。在这些环境中，我创建了以下服务-Azure SQL，Azure数据工厂，Azure存储帐户V2，Azure数据砖和Azure密钥保管库。

In all Azure SQL, I uploaded two table — history data and score data. The description of this data you can find by this link. In the Azure storage account, I created a container with three folders — RawData, PreprocData, and Results. The next step is the creation of Secrets in Azure Key Vaults. I need a secret for Azure Storage Account, where I used Access key and Connection string for my Azure SQL DB. The one important thing, to get access from Azure Data Factory to your secrets you need to make some configuration in Access policy in Azure storage account —

在所有Azure SQL中，我上传了两个表-历史数据和分数数据。您可以通过此链接找到此数据的描述。在Azure存储帐户中，我创建了一个包含三个文件夹的容器-RawData，PreprocData和Results。下一步是在Azure Key Vault中创建Secrets。我需要一个Azure存储帐户的机密，在其中我为Azure SQL DB使用了访问密钥和连接字符串。重要的一件事是，要从Azure数据工厂访问您的机密，您需要在Azure存储帐户的访问策略中进行一些配置-

As you can see I added one more Access policy with an appropriate Azure Data factory service and Azure Databricks, which I created early and permit to get a secret.

如您所见，我还添加了一个访问策略，其中包含适当的Azure数据工厂服务和Azure Databricks，这些是我早先创建的并允许获取秘密。

The next step is to configure the Azure Databricks. To reference secrets stored in an Azure Key Vault, I created a secret scope backed by Azure Key Vault. To create it go to https://<databricks-instance>#secrets/createScope. This URL is case sensitive; scope in createScope must be uppercase and fill this form:

下一步是配置Azure Databricks。为了引用存储在Azure Key Vault中的机密，我创建了一个由Azure Key Vault支持的机密范围。要创建它，请访问https://<databricks-instance>#secrets/createScope 。该网址区分大小写； createScope中的createScope必须为大写并填写以下表格：

The next step in Azure Databricks configuration is — connect my notebooks to the Azure DevOps repository. To do it, just open your notebook, push to Sync and fill the form with information about your Azure DevOps repository. In this way, you can commit updates in your Notebook to the repository.

Azure Databricks配置的下一步是-将笔记本连接到Azure DevOps存储库。为此，只需打开笔记本，按“同步”，然后在表单中填写有关Azure DevOps存储库的信息。这样，您可以将Notebook中的更新提交到存储库。

The base configuration Azure services are over, lets start to create a pipeline in Azure Data Factory. The first step is the connection of ADF to the Azure DevOps repository. I can connect it in two ways — during service creation and in ADF configuration. I need to config this connection only for the Dev environment.

Azure服务的基本配置已结束，让我们开始在Azure Data Factory中创建管道。第一步是将ADF连接到Azure DevOps存储库。我可以通过两种方式进行连接-在服务创建期间和在ADF配置中。我只需要为开发环境配置此连接。

Ok, now we can choose the branch in which we would like to develop our pipeline. The next step — create Linked services to all Azure services.

好的，现在我们可以选择要在其中开发管道的分支。下一步-创建到所有Azure服务的链接服务。

To create Azure Databricks Linked service I filled the next form:

要创建Azure Databricks链接服务，我填写了以下表单：

To create Azure SQL DB Linked service I filled the next form:

要创建Azure SQL DB链接服务，我填写了以下表单：

To create Azure Storage Account Linked service I filled the next form:

要创建Azure存储帐户链接服务，我填写了以下表单：

To create Azure Key Vault Linked service I filled the next form:

要创建Azure Key Vault链接服务，我填写了以下表单：

To control Linked services names between different environment I create use Global parameters in ADF, were created three parameters — dev, stg, prd.

为了控制不同环境之间的链接服务名称，我在ADF中使用Global参数创建了三个参数-dev，stg，prd。

Let’s create a pipeline:

让我们创建一个管道：

The first step of my pipeline is Dataflow:

我的管道的第一步是数据流：

In this Dataflow, I get data from Azure SQL DB, convert data format, select some column from the table and upload results to Azure Data Lake in the appropriate folder. This process was created for history and score(new data) tables.

在此数据流中，我从Azure SQL DB获取数据，转换数据格式，从表中选择某些列，然后将结果上传到适当文件夹中的Azure Data Lake。此过程是为历史记录和得分(新数据)表创建的。

The next steps of the pipeline are run data_preparation_model_train and then score_new_data script. To do it we need:

管道的下一步是运行data_preparation_model_train，然后运行score_new_data脚本。为此，我们需要：

Select appropriate Linkage service
选择适当的链接服务
Select the destination of Notebook
选择笔记本的目的地
Add Parameters with Storage name
使用存储名称添加参数

The same make with the score_new_data script.

与score_new_data脚本相同。

This all stages of my ADF pipeline, the next steps are validated, create pull requests to join my branch with masters, and then publish all updates to the master branch from ADF. The results:

我的ADF管道的所有阶段都将经过验证，接下来的步骤将被创建，以创建将请求与我的分支加入主分支的请求，然后将所有更新从ADF发布到master分支。结果：

Let’s start to config CI/CD in Azure DevOps.

让我们开始在Azure DevOps中配置CI / CD。

The first step — create a new release pipeline (In Azure DevOps portal navigate to Pipelines -> Releases and click on a New pipeline)

第一步-创建一个新版本的管道(在Azure DevOps门户中，导航到“管道->版本”，然后单击“新建”管道)

Select a template window that shows various pre-configured templates. In the case, of Data Factory, the choice is to create an Empty job and name it for example — ADF-DevOps2020-CICD:

选择一个显示各种预配置模板的模板窗口。在这种情况下，数据工厂的，选择是创建一个空的工作 ，并将其命名为例子- ADF-DevOps2020-CICD：

Create a new stage “Staging” (test env):

创建一个新阶段“ Staging”(测试环境)：

Now we have a blank release pipeline created. The next step is to create variable groups and variables that will be used by tasks. For this we need to Navigate to Pipelines -> Library and create a new variable group. In the new form fill the next information and then clone it to create the same for the Production stage.

现在，我们创建了一个空白的发布管道。下一步是创建将由任务使用的变量组和变量。为此，我们需要导航到Pipelines-> Library并创建一个新的变量组。在新表单中，填写下一个信息，然后将其克隆以为生产阶段创建相同的信息。

As a result, I have got two variable groups. Later I will be mapped to the corresponding stages.

结果，我得到了两个变量组。稍后，我将映射到相应的阶段。

The next step — create release pipeline variables. Variables will contain environment-independent values, because of the inheritance of values of variable groups. Therefore, the variable value will be adjusted to a stage where it is used. To create it return to Pipelines -> Releases and then open tab Variables. I created the next variables:

下一步-创建发布管道变量。由于变量组的值是继承的，因此变量将包含与环境无关的值。因此，将变量值调整到使用它的阶段。要创建它，请返回管道->发布 ，然后打开选项卡变量。我创建了下一个变量：

It is time to create and configure the development stage.

现在是时候创建和配置开发阶段了。

These actions will create a Development stage. Every time we will update our master branch it will trigger Continuous Delivery Pipeline to deliver updated ARM template to other environments, like staging and production. We also can automate this process creating the trigger:

这些行动将创造一个发展阶段。每次我们更新master分支时，都会触发持续交付管道，以将更新的ARM模板交付到其他环境，例如登台和生产。我们还可以使创建触发器的过程自动化：

Now it is time to config out the Staging or Test stage. Time to create several Tasks for this stage. For my first pipeline, I created two jobs — Deploy pipeline in Azure Data Factory And Copy Python Notebooks to Azure Databricks.

现在是时候配置暂存或测试阶段了。是时候为该阶段创建几个任务了。对于我的第一个管道，我创建了两个作业-在Azure数据工厂中部署管道以及将Python笔记本复制到Azure Databricks。

Let’s go into more detail with Task ARM template deployment. The main thing I would like to describe -

让我们更详细地介绍Task ARM模板部署。我想描述的主要内容是-

In a Resource group field fill a variable: $(resourceGroup).
在资源组字段中，填充变量： $(resourceGroup) 。
Choose a template location: “Linked artifact”
选择模板位置：“链接的工件”
Choose ARM template and template parameter files
选择ARM模板和模板参数文件

Override default template parameters by variables that were created previously
用先前创建的变量覆盖默认模板参数

In the results, we have got the next form:

结果中，我们得到了下一个表格：

My next Task is Databricks Deploy Notebooks. Let’s go into more detail with it. We need to fill:

我的下一个任务是Databricks Deploy Notebooks。让我们对其进行更详细的介绍。我们需要填写：

The folder where I would like to copy my notebooks in Azure Databricks. app id, I created earlier. (How to create — link.)
我要在Azure Databricks中复制笔记本的文件夹。我之前创建的应用ID。 (如何创建— 链接。)
The same app id
相同的应用程序ID
Subscription id
订阅编号
Tenant id
租户编号

As a result, I have got two tasks at this stage:

结果，在此阶段，我有两个任务：

I also need the Production stage. Click clone Staging environment to a Production:

我还需要生产阶段。单击克隆到生产环境的暂存环境：

The production stage was created just by cloning a staging without extra adjustments or additions. This is because the underlying job was configured by variables which hold pointers to all external references, like resource groups, key vaults, storage accounts. Also for this stage recommend creating “Pre-deployment” approval. Therefore, when Staging deployment is over the execution flow will be waiting for action from an assigned approver. I can react directly from a notification email or by using DevOps portal UI.

仅通过克隆阶段即可创建生产阶段，而无需进行额外的调整或添加。这是因为基础作业是由变量配置的，这些变量包含指向所有外部引用的指针，例如资源组，密钥库，存储帐户。另外，对于此阶段，建议创建“预部署”批准。因此，当分阶段部署结束时，执行流程将等待分配的批准者的操作。我可以直接从通知电子邮件或使用DevOps门户用户界面做出React。

And the last step of all configuration is mapping variable groups to stages. This step is necessary to map stage-specific variable values to certain stages. As an example, a variable $(Environment) has a value “prd” if the pipeline a job in a production stage and it set the value to “stg” in case if the job is triggered in staging. To make it open tab Variables -> Variable Groups and click on “Link variable group”. Map variable group “Production” to stage “Production” and select also scope “stage Production” instead of “Release” and then repeat the same action for “Staging”.

所有配置的最后一步是将变量组映射到阶段。此步骤对于将特定于阶段的变量值映射到某些阶段是必需的。例如，如果在生产阶段通过管道传输作业，则变量$(Environment)的值为“ prd”，如果在阶段中触发了作业，则变量的值将为“ stg”。为此，请打开选项卡变量->变量组，然后单击“链接变量组”。将变量组“ Production”映射到“ Production”阶段，并选择范围“ stage Production”而不是“ Release”，然后对“ Staging”重复相同的操作。

Time to run and check this simple pipeline.

是时候运行并检查这个简单的管道了。

As a result, I found the ADF pipeline and Azure Databricks Notebooks on appropriate services in the Stage and Production Resource group. I also can analyze the detail of all task at every stage, for example:

结果，我在舞台和生产资源组中的适当服务上找到了ADF管道和Azure Databricks笔记本。我还可以分析每个阶段的所有任务的详细信息，例如：

To make this pipeline more complete, I also can add to the Test stage some tasks with tests. These tests, for example, can run the ADF pipeline and analyze its results — table after copy in Azure Data Lake, MLflow logs, and results in the table after the score. After passing all these tests we can approve the Production stage.

为了使此管道更加完整，我还可以将一些带有测试的任务添加到Test阶段。例如，这些测试可以运行ADF管道并分析其结果-在Azure Data Lake中复制后的表，MLflow日志以及得分后的表中的结果。通过所有这些测试之后，我们可以批准生产阶段。

In this story, I would like to illustrate in a detailed step-by-step way of how release pipelines can be created and configured to enable CI/CD practices in Azure Data Factory environments. It shows limitations and also possibilities that Azure DevOps, Azure Data Factory, Azure SQL DB, Azure Data Lake, and Azure Databricks can bring if they are used together.

在本故事中，我想逐步详细地说明如何创建和配置发布管道，以在Azure Data Factory环境中启用CI / CD实践。它显示了将Azure DevOps，Azure数据工厂，Azure SQL DB，Azure数据湖和Azure Databricks一起使用会带来的局限性和可能性。

Thanks for reading.

谢谢阅读。

Useful links:

有用的链接：

翻译自: https://towardsdatascience.com/ci-cd-pipeline-with-azure-devops-for-data-science-project-f263586c266e