使用dlt实现Chess.com数据增量加载与文件系统存储管理-优快云博客

本文链接：https://blog.youkuaiyun.com/gitblog_00414/article/details/148574951

使用dlt实现Chess.com数据增量加载与文件系统存储管理

概述

本文介绍如何利用dlt库构建一个高效的数据管道，从Chess.com REST API获取国际象棋比赛数据，并将其存储到文件系统中。我们将重点讲解增量加载策略的实现以及如何管理历史数据文件，确保数据的一致性和完整性。

核心功能解析

1. 数据源配置

chess_com_source函数是数据管道的核心，它配置了与Chess.com API的交互方式：

@dlt.source
def chess_com_source(username: str, months: List[Dict[str, str]]) -> Iterator[DltResource]:
    for month in months:
        config: RESTAPIConfig = {
            "client": {"base_url": "https://api.chess.com/pub/"},
            "resources": [{
                "name": f"chess_com_games_{year}_{month_str}",
                "write_disposition": "append",
                "endpoint": {
                    "path": f"player/{username}/games/{year}/{month_str}",
                    "response_actions": [{"status_code": 404, "action": "ignore"}],
                },
                "primary_key": ["url"],
            }]
        }
        yield from rest_api_resources(config)

关键配置项说明：

write_disposition: "append"：采用追加模式写入数据
primary_key: ["url"]：使用游戏URL作为主键，确保数据去重
404状态码处理：忽略不存在的月份数据

2. 时间范围生成器

generate_months函数生成指定时间范围内的所有月份：

def generate_months(start_year: int, start_month: int, end_year: int, end_month: int):
    start_date = p.datetime(start_year, start_month, 1)
    end_date = p.datetime(end_year, end_month, 1)
    current_date = start_date
    while current_date <= end_date:
        yield {"year": str(current_date.year), "month": f"{current_date.month:02d}"}
        if current_date.month == 12:
            current_date = current_date.replace(year=current_date.year + 1, month=1)
        else:
            current_date = current_date.replace(month=current_date.month + 1)

该函数使用pendulum库处理日期，确保跨年月份的正确生成。

3. 历史数据清理机制

delete_old_backfills函数实现了智能的历史数据清理：

def delete_old_backfills(load_info: LoadInfo, p: dlt.Pipeline, table_name: str):
    load_id = load_info.loads_ids[0]
    pattern = re.compile(rf"{load_id}")
    fs_client: FilesystemClient = p._get_destination_clients()[0]
    table_dir = os.path.join(fs_client.dataset_path, table_name)
    
    if fs_client.fs_client.exists(table_dir):
        for root, _dirs, files in fs_client.fs_client.walk(table_dir):
            for file in files:
                file_path = os.path.join(root, file)
                if not pattern.search(file_path):
                    fs_client.fs_client.rm(file_path)

清理策略特点：

基于当前加载ID识别需要保留的文件
使用正则表达式匹配文件名
递归遍历目录结构，确保彻底清理

实际应用场景

1. 初始化管道

dest_ = dlt.destinations.filesystem("_storage")
pipeline = dlt.pipeline(
    pipeline_name="chess_com_data", 
    destination=dest_, 
    dataset_name="chess_games"
)

这里我们配置了文件系统作为目标存储，数据将保存在本地_storage目录下。

2. 执行数据加载

months = list(generate_months(2023, 1, 2023, 12))
source = chess_com_source("MagnusCarlsen", months)
info = pipeline.run(source)

示例中我们获取了2023年全年的Magnus Carlsen比赛数据。

技术亮点

增量加载策略：通过write_disposition: "append"和主键配置实现数据增量更新
数据一致性保障：自动清理旧数据文件，防止重复数据
错误处理机制：优雅处理404等API错误，确保管道稳定性
灵活的时间范围配置：支持任意时间跨度的数据获取

最佳实践建议

数据分区策略：按照年月分区存储，便于管理和查询
错误监控：建议添加日志记录和报警机制，监控API调用异常
性能优化：对于大数据量场景，可考虑并行获取不同月份数据
存储优化：定期归档旧数据，释放存储空间

总结

本文展示的dlt实现方案提供了一种高效、可靠的Chess.com数据获取与存储方法。通过合理的增量加载策略和文件管理机制，既保证了数据的完整性，又优化了存储空间的使用。这种模式可以轻松扩展到其他类似的数据源场景。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考