Datawhale 和天池联合举办了【零基础入门数据挖掘-心跳信号分类预测】的入门赛事,比赛链接:
https://tianchi.aliyun.com/competition/entrance/531883/introduction
1. 数据的读取
首先,需要从比赛官网下载数据。
假设就放在 data
目录下,加载数据:
import pandas as pd
import numpy as np
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/testA.csv')
train.head()
train.shape
test.shape
首先定义一个用于压缩数据内存占用的函数:
def reduce_mem_usage(df):
start_mem = df.memory_usage().sum() / 1024**2
print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
for col in df.columns:
col_type = df[col].dtype
if col_type != object:
c_min = df[col].min()
c_max = df[col].max()
if str(col_type)[:3] == 'int':
if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
df[col] = df[col].astype(np.int8)
elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
df[col] = df[col].astype(np.int16)
elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
df[col] = df[col].astype(np.int32)
elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
df[col] = df[col].astype(np.int64)
else:
if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
df[col] = df[col].astype(np.float16)
elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
df[col] = df[col].astype(np.float32)
else:
df[col] = df[col].astype(np.float64)
else:
df[col] = df[col].astype('category')
end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
其中压缩数据内存占用的方式主要是通过修改数据格式实现,比如原本用 int64 存储的数据,如果可以用 int8 而不损失精度,那么内存占用直接减少为原来的 1/8.
由于训练集/测试集中的特征列是一个字符串,需要将它们处理成浮点数形式:
# 训练集特征
train_list = []
for items in train.values:
train_list.append([items[0]] + [float(i) for i in items[1].split(',')] + [items[2]])
train = pd.DataFrame(np.array(train_list))
train.columns = ['id'] + ['s_'+str(i) for i in range(len(train_list[0])-2)] + ['label']
train = reduce_mem_usage(train)
# 测试集特征
test_list=[]
for items in test.values:
test_list.append([items[0]] + [float(i) for i in items[1].split(',')])
test = pd.DataFrame(np.array(test_list))
test.columns = ['id'] + ['s_'+str(i) for i in range(len(test_list[0])-1)]
test = reduce_mem_usage(test)
得到训练集的特征和标签,测试集的特征:
x_train = train.drop(['id','label'], axis=1)
y_train = train['label']
x_test = test.drop(['id'], axis=1)
2. 使用 LightGBM 构建 Baseline
首先简单粗暴地引入一堆可能用到的包:(如果缺失,请自行通过 pip/conda 安装)
import os
import gc
import math
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
定义一下指标函数:
def abs_sum(y_pre,y_tru):
#y_pre为预测概率矩阵
#y_tru为真实类别矩阵
y_pre=np.array(y_pre)
y_tru=np.array(y_tru)
loss=sum(sum(abs(y_pre-y_tru)))
return loss
使用交叉验证进行训练,并获得测试的测试结果:(注意,为了防止CPU被吃满,请根据自身GPU核数修改 nthread 参数值)
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2021
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
test = np.zeros((test_x.shape[0],4))
cv_scores = []
onehot_encoder = OneHotEncoder(sparse=False)
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'multiclass',
'num_class': 4,
'num_leaves': 2 ** 5,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.1,
'seed': seed,
'nthread': 10,
'verbose': -1,
}
model = clf.train(params,
train_set=train_matrix,
valid_sets=valid_matrix,
num_boost_round=2000,
verbose_eval=100,
early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
val_y=np.array(val_y).reshape(-1, 1)
val_y = onehot_encoder.fit_transform(val_y)
print('预测的概率矩阵为:')
print(test_pred)
test += test_pred
score=abs_sum(val_y, val_pred)
cv_scores.append(score)
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
test=test/kf.n_splits
return test
简单包装一层训练函数:(实际上可有可无)
def lgb_model(x_train, y_train, x_test):
lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_test
开始训练:(我的环境中,因为服务器CPU有80核,我 nthread 设置为 50,实际运行时间 2min 出头)
lgb_test = lgb_model(x_train, y_train, x_test)
此时,lgb_test
变量中已经保存了测试集的预测结果。将其保存到文件用于提交:(注意,output 文件夹若不存在则需要新建)
result=pd.read_csv('data/sample_submit.csv')
result['label_0']=lgb_test[:, 0]
result['label_1']=lgb_test[:, 1]
result['label_2']=lgb_test[:, 2]
result['label_3']=lgb_test[:, 3]
result.to_csv(f'output/submit-{time.strftime("%m%d_%H%M")}.csv',index=False)
将 submit-xxx.csv 文件结果提交,最终得分大约 559。(反正入不了名次就是了o(╥﹏╥)o)
3. 参考链接
- 心电图心跳信号分类 - GitHub