AI新能源发电功率预测(第三届世界科学智能大赛)

最新推荐文章于 2025-07-17 09:44:39 发布

SinkAboutIt

最新推荐文章于 2025-07-17 09:44:39 发布

阅读量1.2k

点赞数 15

CC 4.0 BY-SA版权

文章标签：人工智能 python 大数据

本文链接：https://blog.youkuaiyun.com/SinkAboutIt/article/details/147350787

本次比赛主要是要进行风电和光伏电场的功率预测，基于气象数据预测10个新能源场站的功率进行推理。

一、比赛数据说明：

1.气象数据

来自三个不同的气象预报源（NWP_1, NWP_2, NWP_3），每个气象源提供8个变量，包含风速、气温、降水量、云量、辐照度等数据。
每个气象数据文件以.nc格式保存，包含未来24小时的1小时间隔气象预报，文件中有5个维度：
- time：世界时刻（UTC时间）。
- channel：表示气象变量，共8个（例如：风速、气温等）。
- hour：从预报起始时间到每个预报时刻的时间间隔，范围是0到23小时。
- lat和lon：分别是纬度和经度，标识气象数据的空间位置。

本次要求你使用这些气象数据来预测新能源场站的功率，可以选择使用一个或多个气象源的数据，但不需要使用所有的变量。

2.场站实发功率

提供了10个新能源场站（5个风电，5个光伏）的实际功率数据，已经进行了归一化处理。数据的时间间隔为15分钟（北京时间），因此每一天有96个数据点（24小时×4个时间段）。数据中可能会存在缺失值或异常值，需要注意这些问题，可能会影响最终的精度评估。

3.气象变量说明

给出了8个气象变量的具体含义和单位，如风速（u100、v100），气温（t2m），降水量（tp），辐照度（poai和ghi）等。不同气象源的数据结构稍有不同，尤其是NWP_2多了一个msl（海平面气压）。

要求格式：

Input/
├── 1/
│   ├── NWP_1/
│   │   ├── 20240101.nc
│   │   ├── 20240102.nc
│   │   └── ...
│   ├── NWP_2/
│   ├── NWP_3/
├── 2/
├── ...
Output/
├── output1.csv
├── output2.csv
├── ...

评测指标：
统计精度时，每个场站每天单独计算一个精度值，并对其在所有评测天数内的精度值求平均，得到该场站的整体精度。最终精度为十个场站精度的平均值。

每日预测准确率（CR）

每个场站在一天内的预测功率与实际功率之间的差异，以此来计算准确率。公式中涉及到计算每个时段的误差，然后将所有时段的误差进行平均。

场站预测准确率 （Cf）

每个场站的预测准确率是对该场站所有预测日（通常为1天）的准确率的平均。

二、分析本次赛制baseline代码

这里不会从头进行赘述，只简介其中思想。

加载数据

from netCDF4 import Dataset
import numpy as np
import pandas as pd

nc_path = "data/初赛训练集/nwp_data_train/1/NWP_1/20240101.nc"
dataset = Dataset(nc_path, mode='r')     #使用只读格式打开文件
dataset.variables.keys() #访问dataset对象的variables属性，并返回字典的keys，即所有变量名称

加载了一个.nc格式的气象数据文件，查看其中的变量。

channel = dataset.variables["channel"][:]
data = dataset.variables["data"][:]
mean_values = np.array([np.mean(data[:, :, i, :, :][0], axis=(1, 2)) for i in range(8)]).T

提取气象数据并计算每个通道的平均值。

训练数据构建

def get_data(path_template, date):
# 加载指定日期的.nc文件
    dataset = Dataset(path_template.format(date), mode='r')
# 获取气象变量名称列表
    channel = dataset.variables["channel"][:]
# 获取气象数据数组
    data = dataset.variables["data"][:]
# 计算各通道的平均值
    mean_values = np.array([np.mean(data[:, :, i, :, :][0], axis=(1, 2)) for i in range(8)]).T
    return pd.DataFrame(mean_values, columns=channel) # 返回DataFrame

train_path_template = "data/初赛训练集/nwp_data_train/1/NWP_1/{}.nc"
data = [get_data(train_path_template, i) for i in date]
train = pd.concat(data, axis=0).reset_index(drop=True)

目标数据加载

target = pd.read_csv("data/初赛训练集/fact_data/1_normalization_train.csv")
target = target[96:]
target = target[target['时间'].str.endswith('00:00')]

去除前96行数据，筛选整点数据

假设原始数据是这样的：

时间	功率(MW)
2024-01-01 00:00	10.5
2024-01-01 00:15	10.8
2024-01-01 00:30	11.2
2024-01-01 00:45	11.0
...	...

经过处理后：

时间	功率(MW)
2024-01-01 00:00	10.5
2024-01-01 01:00	12.1
2024-01-01 02:00	13.5
...	...

测试数据构建

test_path_template = "data/初赛测试集/nwp_data_test/1/NWP_1/{}.nc"
date_range = pd.date_range(start='2024-12-31', end='2025-02-27')
date = [date.strftime('%Y%m%d') for date in date_range]
data = [get_data(test_path_template, i) for i in date]
test = pd.concat(data, axis=0).reset_index(drop=True)

模型训练

由于本次数据量中等，可以使用LightGBM（基于梯度提升决策树，GBDT），与5 折交叉验证（KFold）。

def cv_model(clf, train_x, train_y, test_x):
    # 5折交叉验证设置
    folds = 5
    kf = KFold(n_splits=folds, shuffle=True)  # 数据随机打乱
    oof = np.zeros(train_x.shape[0])          # 存储验证集预测结果
    test_predict = np.zeros(test_x.shape[0])  # 存储测试集预测结果
    cv_scores = []                            # 存储每折的评估分数
    
    # 开始交叉验证
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        # 划分训练集和验证集
        trn_x, trn_y = train_x.iloc[train_index], train_y[train_index]
        val_x, val_y = train_x.iloc[valid_index], train_y[valid_index]
        
        # 转换为LightGBM数据集格式
        train_matrix = clf.Dataset(trn_x, label=trn_y)
        valid_matrix = clf.Dataset(val_x, label=val_y)
        
        # LightGBM参数配置
        params = {
            'boosting_type': 'gbdt',       # 使用GBDT算法
            'objective': 'regression',     # 回归任务
            'metric': 'rmse',             # 评估指标为RMSE
            'num_leaves': 256,            # 叶子节点数
            'learning_rate': 0.1,         # 学习率
            'feature_fraction': 0.8,      # 特征采样比例
            'lambda_l2': 10,              # L2正则化
            'verbose': -1                 # 不显示训练日志
        }
        
        # 训练模型（3000轮，带早停）
        model = clf.train(params, train_matrix, 3000, valid_sets=[valid_matrix])
        
        # 预测验证集和测试集
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)
        
        # 保存结果
        oof[valid_index] = val_pred                     # 验证集预测
        test_predict += test_pred / kf.n_splits         # 测试集预测平均
        
        # 计算评估分数（1/(1+RMSE)）
        score = 1/(1+np.sqrt(mean_squared_error(val_pred, val_y)))
        cv_scores.append(score)
            
    return oof, test_predict  # 返回验证集和测试集预测结果

# 调用函数训练模型
lgb_oof, lgb_test = cv_model(lgb, train, target['功率(MW)'], test)

结果输出

output = pd.read_csv("data/output/output1.csv").reset_index(drop=True)
output["power"] = lgb_test
# 将数据重复4次
lgb_test = [item for item in lgb_test for _ in range(4)]

三、改进代码

加载数据的方式与baseline加载方式类似，就不赘述。这里只更改模型，将使用transformer同时加入物理约束进行优化模型

首先通过使用 PositionalEncoding 对输入数据进行位置编码，解决序列数据中位置信息丢失的问题。

class PositionalEncoding(nn.Module):
    """位置编码"""

    def __init__(self, d_model: int, max_len: int = 5000):
        super().__init__()
        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """添加位置编码"""
        x = x + self.pe[:x.size(1)]
        return x

使用了Transformer架构的编码器部分来处理气象序列数据。这个编码器通过自注意力机制（Self-Attention）学习气象数据之间的时序关系。然后从编码器输出的特征中预测未来96个时间点的功率。

同时根据是否为光伏场站（is_solar）应用不同的约束，例如：

光伏场站只能在白天的有效发电时间内有功率输出（6:00到18:00）。
风电场站的功率输出不能超过最大风电功率。

加入物理约束，具体代码如下：

class Transformer(nn.Module):
    def __init__(self, input_dim: int = 8, num_heads: int = 4, forecast_steps: int = 96):
        super().__init__()
        self.input_dim = input_dim
        self.forecast_steps = forecast_steps

        # 输入嵌入层
        self.embedding = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.GELU(),
            nn.LayerNorm(64)
        )

        # Transformer编码器
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=64, nhead=num_heads, dim_feedforward=256,
            dropout=0.1, activation='gelu', batch_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=4)

        # 解码器
        self.decoder = nn.Sequential(
            nn.Linear(64, 128),
            nn.GELU(),
            nn.Dropout(0.2),
            nn.Linear(128, forecast_steps)
        )

        # 位置编码
        self.pos_encoder = PositionalEncoding(64)

    def forward(self, x: torch.Tensor, is_solar: bool = False) -> torch.Tensor:
        """输入: (batch, lookback, input_dim)"""
        x = self.embedding(x)  # (batch, lookback, 64)
        # 位置编码
        x = self.pos_encoder(x)
        x = self.transformer(x)
        x = x.mean(dim=1)  # (batch, 64)
        # 预测
        output = self.decoder(x)
        # 物理约束
        output = self.apply_constraints(output, is_solar)
        return output
    def apply_constraints(self, outputs: torch.Tensor, is_solar: bool) -> torch.Tensor:
        """应用物理约束"""
        outputs = torch.clamp(outputs, min=0)  # 所有功率非负

        if is_solar:  # 光伏场站
            hour_idx = torch.arange(24).repeat_interleave(4)[:96].to(device)
            mask = (hour_idx < SOLAR_HOURS[0]) | (hour_idx >= SOLAR_HOURS[1])
            outputs[..., mask] = 0
        else:  # 风电场站
            outputs = torch.clamp(outputs, max=MAX_WIND_POWER)

        return outputs

训练模型过程使用自适应学习率，优化器使用AdamW

训练过程中使用 Huber损失 来度量模型的误差

def train_model(dataset: SimpleDataset, model: nn.Module, epochs: int = 20):
    """训练模型"""
    loader = DataLoader(dataset, batch_size=32, shuffle=True, num_workers=0)

    optimizer = optim.AdamW(model.parameters(), lr=1e-4, weight_decay=0.01)
    scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, epochs)
    criterion = nn.HuberLoss()

    model.train()
    for epoch in range(epochs):
        total_loss = 0
        pbar = tqdm(loader, desc=f"Epoch {epoch + 1}/{epochs}")

        for nwp_seq, power in pbar:
            nwp_seq, power = nwp_seq.to(device), power.to(device)

            optimizer.zero_grad()
            pred = model(nwp_seq, dataset.is_solar)
            loss = criterion(pred, power)
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

            total_loss += loss.item()
            pbar.set_postfix(loss=loss.item())

        scheduler.step()
        print(f"Epoch {epoch + 1} Avg Loss: {total_loss / len(loader):.4f}")

预测过程中，对于异常值进行补零操作

def predict(model: nn.Module, dataset: SimpleDataset, expected_length: int = 5568) -> np.ndarray:
    """生成预测结果（修复维度不匹配问题）"""
    loader = DataLoader(dataset, batch_size=32, shuffle=False, num_workers=0)

    model.eval()
    all_preds = []
    with torch.no_grad():
        for nwp_seq, _ in tqdm(loader, desc="Predicting"):
            # 确保输入数据形状正确
            if nwp_seq.dim() == 2:
                nwp_seq = nwp_seq.unsqueeze(0)  # 添加batch维度

            pred = model(nwp_seq.to(device), dataset.is_solar)

            # 确保预测输出形状正确
            if pred.dim() == 1:
                pred = pred.unsqueeze(0)  # 添加batch维度

            all_preds.append(pred.cpu().numpy())

    # 合并所有预测结果
    if len(all_preds) > 0:
        preds = np.concatenate(all_preds, axis=0)  # 沿batch维度连接
        preds = preds.reshape(-1)  # 展平为一维数组

        # 调整到预期长度
        if len(preds) < expected_length:
            # 不足部分用最后一天的数据循环填充
            last_day = preds[-96:]
            repeat_times = (expected_length - len(preds)) // 96 + 1
            padding = np.tile(last_day, repeat_times)[:expected_length - len(preds)]
            preds = np.concatenate([preds, padding])
        elif len(preds) > expected_length:
            # 超过部分截断
            preds = preds[:expected_length]
    else:
        # 如果没有预测结果，生成全零数组
        preds = np.zeros(expected_length)

    return preds

最终将outputs放入zip文件打包

def save_predictions(preds: np.ndarray, station_id: int):
    """保存预测结果（确保格式正确）"""
    os.makedirs("output", exist_ok=True)
    assert len(preds) == 5568, f"预测结果长度应为5568，实际为{len(preds)}"
    # 生成时间戳
    timestamps = pd.date_range(
        start="2025-01-01",
        periods=5568,
        freq="15T"
    ).strftime("%Y/%m/%d %H:%M")
    # 创建DataFrame
    df = pd.DataFrame({
        "": timestamps,
        "power": preds[:5568]  # 确保只取5568个点
    })
    # 保存文件
    output_path = f"output/output{station_id}.csv"
    df.to_csv(output_path, index=False)
    print(f"场站 {station_id} 预测结果已保存，共 {len(preds)} 个数据点")
def main():
    try: