Prefect项目实战：构建弹性数据管道与无服务器部署-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_01031/article/details/148374531

Prefect项目实战：构建弹性数据管道与无服务器部署

prefect PrefectHQ/prefect: 是一个分布式任务调度和管理平台。适合用于自动化任务执行和 CI/CD。特点是支持多种任务执行器，可以实时监控任务状态和日志。项目地址: https://gitcode.com/gh_mirrors/pr/prefect

前言

在现代数据工程实践中，构建可靠且弹性的数据管道至关重要。本文将基于Prefect项目，深入探讨如何为MLB(美国职业棒球大联盟)数据管道添加故障处理机制和数据质量检查，并将其部署到无服务器基础设施上。

环境准备

在开始之前，我们需要确保具备以下环境：

Prefect Cloud账户
MotherDuck账户(包含有效的token)
AWS账户(配置了S3访问权限)
Python开发环境(建议3.8+版本)

# 克隆示例代码库
git clone https://example.com/dev-day-zoom-out.git
cd dev-day-zoom-out/track_1_build_workflows/session_2_resilent_workflows/

故障处理机制

Prefect提供了多种故障处理策略，让我们能够优雅地应对管道执行中的各种异常情况。

基础重试机制

最简单的重试策略是设置固定次数的重试：

@task(retries=10)
def get_recent_games(team_name, start_date, end_date):
    # 模拟70%的API调用失败率
    if random.random() < 0.7:
        time.sleep(2)
        raise Exception("API调用模拟失败")
    # 实际API调用逻辑...

延迟重试策略

对于可能临时过载的外部服务，可以添加延迟重试：

@task(retries=10, retry_delay_seconds=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
def get_recent_games(team_name, start_date, end_date):
    # 业务逻辑...

指数退避策略

指数退避是处理分布式系统中"惊群效应"的经典方法：

@task(retries=4, retry_delay_seconds=exponential_backoff(backoff_factor=2))
def get_recent_games(team_name, start_date, end_date):
    # 业务逻辑...

自定义重试处理器

对于更复杂的场景，可以实现自定义重试逻辑：

def retry_handler(task, task_run, state) -> bool:
    """自定义重试条件判断"""
    try:
        state.result()
    except Exception as e:
        return isinstance(e, TimeoutError)  # 仅对超时异常重试

@task(retries=10, retry_condition_fn=retry_handler)
def get_recent_games(team_name, start_date, end_date):
    # 业务逻辑...

数据质量检查

Prefect的事务接口(transactional interface)为数据管道提供了原子性保证。

实现数据质量检查

from prefect.transactions import transaction

@save_raw_data_to_file.on_rollback
def del_file(txn):
    """数据质量检查失败时删除文件"""
    os.unlink(txn.get("filepath"))
    
@task
def quality_test(file_path):
    """检查数据完整性"""
    with open(file_path, "r") as f:
        data = json.load(f)
    if len(data) < 5:  # 至少需要5场比赛数据
        raise ValueError(f"数据不足! 文件仅包含{len(data)}场比赛")

在流程中使用事务

@flow
def mlb_flow_rollback(team_name, start_date, end_date):
    # 获取比赛数据...
    
    with transaction() as txn:
        txn.set("filepath", raw_file_path)
        save_raw_data_to_file(game_data, raw_file_path)
        quality_test(raw_file_path)  # 失败将触发回滚
        upload_raw_data_to_s3(raw_file_path)
    
    # 后续处理...

无服务器部署

将管道部署到Prefect的托管执行工作池(Managed Execution work pool)，实现真正的无服务器运行。

创建工作池

prefect work-pool create managed-pool --type prefect:managed

创建部署

from prefect import flow
from pathlib import Path

def read_requirements(file_path="requirements.txt"):
    """读取依赖文件"""
    return [req.strip() for req in Path(file_path).read_text().splitlines() 
            if req.strip() and not req.startswith('#')]

if __name__ == "__main__":
    flow.from_source(
        source="https://example.com/dev-day-zoom-out.git",
        entrypoint="path/to/mlb_flow.py:mlb_flow",
    ).deploy(
        name="mlb-managed-flow",
        work_pool_name="managed-pool",
        parameters={"team_name": "phillies", "start_date": "06/01/2024", "end_date": "06/30/2024"},
        job_variables={"pip_packages": read_requirements()}
    )

添加定时调度

from prefect.client.schemas.schedules import CronSchedule

flow.from_source(...).deploy(
    ...,
    schedule=CronSchedule(cron="0 0 * * *")  # 每天午夜执行
)