目录
数据集
训练集中给出美国某些州五天COVID-19的感染人数(及相关特征数据),测试集中给出前四天的相关数据,预测第五天的感染人数。 下载地址:ML2022Spring-hw1 | Kaggle
特征包括:
● States (37, 独热编码)
● COVID-like illness (4)
○ cli、ili …
● Behavior Indicators (8)
○ wearing_mask、travel_outside_state …
● Mental Health Indicators (3)
○ anxious、depressed …
● Tested Positive Cases (1)
○ tested_positive (this is what we want to predict)
训练集有2699行, 118列 (id + 37 states + 16 features x 5 days)
测试集有1078,117列 (without last day's positive rate)
导包
# Numerical Operations
import math
import numpy as np
from sklearn.model_selection import train_test_split
# Reading/Writing Data
import pandas as pd
import os
import csv
# For Progress Bar
from tqdm import tqdm
from d2l import torch as d2l
# Pytorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, random_split, TensorDataset
# For plotting learning curve
from torch.utils.tensorboard import SummaryWriter
辅助函数
设定种子
我的理解是,模型初始化和验证集划分都要用到seed,这里固定下来
def same_seed(seed):
'''Fixes random number generator seeds for reproducibility.'''
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
划分数据集
可以使用random_split函数
def train_valid_split(data_set, valid_ratio, seed):
'''Split provided training data into training set and validation set'''
valid_set_size = int(valid_ratio * len(data_set))
train_set_size = len(data_set) - valid_set_size
train_set, valid_set = random_split(data_set, [train_set_size, valid_set_size],
generator=torch.Generator().manual_seed(seed))
return np.array(train_set), np.array(valid_set)
也可以采取train_test_split函数
def train_valid_split(data_set, valid_ratio, seed):
'''Split provided training data into training set and validation set'''
train_set, valid_set = train_test_split(data_set, test_size=valid_ratio, random_state=seed)
return np.array(train_set), np.array(valid_set)
模型
整个作业最重要的就是模型和特征选择,还有超参数设置。这里放原代码
class My_Model(nn.Module):
def __init__(self, input_dim):
super(My_Model, self).__init__()
# TODO: modify model's structure, be aware of dimensions.
self.layers = nn.Sequential(
nn.Linear(input_dim, 16),
nn.ReLU(),
nn.Linear(16, 8),
nn.ReLU(),
nn.Linear(8, 1)
)
def forward(self, x):
x = self.layers(x)
x = x.squeeze(1) # (B, 1) -> (B)
return x
特征选择
对原代码进行了改动,因为使用了TensorDataset,它的自变量应该是tensor,所以要把train_data, valid_data, test_data变为tensor格式
def select_feat(train_data, valid_data, test_data, select_all=True):
'''Selects useful features to perform regression'''
train_data = torch.FloatTensor(train_data)
valid_data = torch.FloatTensor(valid_data)
test_d