在Kedro项目中创建数据科学流水线指南-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00210/article/details/148415846

在Kedro项目中创建数据科学流水线指南

kedro 项目地址: https://gitcode.com/gh_mirrors/ked/kedro

概述

本文将详细介绍如何在Kedro项目中添加数据科学流水线，这是构建机器学习项目的重要环节。我们将从基础概念开始，逐步讲解如何扩展默认项目流水线，实现数据科学工作流程。

数据科学流水线核心组件

数据科学流水线通常包含以下几个关键部分：

数据处理节点：负责数据分割、特征工程等预处理工作
模型训练节点：构建和训练机器学习模型
模型评估节点：评估模型性能并输出指标

节点函数实现

在nodes.py文件中，我们定义了三个核心函数：

def split_data(data: pd.DataFrame, parameters: dict[str, Any]) -> Tuple:
    """将数据分割为特征和目标变量的训练集和测试集"""
    X = data[parameters["features"]]
    y = data["price"]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
    )
    return X_train, X_test, y_train, y_test

def train_model(X_train: pd.DataFrame, y_train: pd.Series) -> LinearRegression:
    """训练线性回归模型"""
    regressor = LinearRegression()
    regressor.fit(X_train, y_train)
    return regressor

def evaluate_model(regressor: LinearRegression, X_test: pd.DataFrame, y_test: pd.Series):
    """评估模型性能并记录R²分数"""
    y_pred = regressor.predict(X_test)
    score = r2_score(y_test, y_pred)
    logger.info("模型在测试数据上的R²系数为 %.3f", score)

参数配置管理

Kedro使用YAML文件管理配置参数，数据科学流水线的参数存储在parameters_data_science.yml中：

model_options:
  test_size: 0.2
  random_state: 3
  features:
    - engines
    - passenger_capacity
    - crew
    - d_check_complete
    - moon_clearance_complete
    - iata_approved
    - company_rating
    - review_scores_rating

这些参数控制着数据分割的比例、随机种子以及用于建模的特征列。

模型持久化配置

在catalog.yml中配置模型保存方式：

regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor.pickle
  versioned: true

设置versioned: true启用模型版本控制，每次运行流水线都会保存一个新版本的模型，便于追踪模型迭代历史。

流水线构建

在pipeline.py中定义数据科学流水线：

def create_pipeline(**kwargs) -> Pipeline:
    return pipeline([
        node(
            func=split_data,
            inputs=["model_input_table", "params:model_options"],
            outputs=["X_train", "X_test", "y_train", "y_test"],
            name="split_data_node",
        ),
        node(
            func=train_model,
            inputs=["X_train", "y_train"],
            outputs="regressor",
            name="train_model_node",
        ),
        node(
            func=evaluate_model,
            inputs=["regressor", "X_test", "y_test"],
            outputs=None,
            name="evaluate_model_node",
        ),
    ])

流水线运行与切片

完整流水线运行

执行以下命令运行整个项目流水线（包括数据处理和数据科学部分）：

kedro run

流水线切片运行

如果只需要运行数据科学流水线部分，可以使用--pipeline参数：

kedro run --pipeline=data_science

这种切片技术特别适用于以下场景：

只调整模型超参数而不修改数据处理逻辑
快速验证模型变更
调试特定流水线部分

高级主题：模块化流水线

随着项目复杂度增加，建议将流水线模块化。模块化流水线具有以下优势：

逻辑隔离：不同功能模块相互独立
可重用性：同一模板可多次实例化
易于维护：各模块可单独测试和更新

模块化实现示例

更新目录配置：为不同实例配置不同命名空间

active_modelling_pipeline.regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor_active.pickle
  versioned: true

candidate_modelling_pipeline.regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor_candidate.pickle
  versioned: true

参数配置分离：为不同实例设置不同参数

active_modelling_pipeline:
    model_options:
      test_size: 0.2
      random_state: 3
      features: [engines, passenger_capacity, crew, d_check_complete, ...]

candidate_modelling_pipeline:
    model_options:
      test_size: 0.2
      random_state: 8
      features: [engines, passenger_capacity, crew, review_scores_rating]

流水线模板实例化：创建两个独立的流水线实例

def create_pipeline(**kwargs) -> Pipeline:
    pipeline_template = pipeline([...])  # 原始节点定义
    
    active_pipeline = pipeline(
        pipe=pipeline_template,
        inputs="model_input_table",
        namespace="active_modelling_pipeline",
    )
    
    candidate_pipeline = pipeline(
        pipe=pipeline_template,
        inputs="model_input_table",
        namespace="candidate_modelling_pipeline",
    )

    return active_pipeline + candidate_pipeline

这种模式特别适合以下场景：