1. 赛题
1.1 赛题数据
数据主要为1列心跳信号序列数据,其中每个样本的信号序列采样频次一致,长度相等。为了保证比赛的公平性,将会从中抽取10万条作为训练集,2万条作为测试集A,2万条作为测试集B,同时会对心跳信号类别(label)信息进行脱敏。
比赛链接:https://tianchi.aliyun.com/competition/entrance/531883/information
字段如下:
- id:心跳信号分配的唯一标识
- heartbeat_signals:心跳信号序列
- label:脱敏后的心跳信号类别(0、1、2、3)
1.2 评测标准
选手需提交4种不同心跳信号预测的概率,选手提交结果与实际心跳类型结果进行对比,求预测的概率与真实值差值的绝对值(越小越好)。
具体计算公式如下:
针对某一个信号,若真实值为
[
y
1
,
y
2
,
y
3
,
y
4
]
[y_1,y_2,y_3,y_4]
[y1,y2,y3,y4],模型预测概率值为
[
a
1
,
a
2
,
a
3
,
a
4
]
[a_1,a_2,a_3,a_4]
[a1,a2,a3,a4],那么该模型的平均指标
a
b
s
−
s
u
m
abs-sum
abs−sum为
a
b
s
−
s
u
m
=
∑
j
=
1
n
∑
i
=
1
4
∣
y
i
−
a
i
∣
abs−sum = \sum_{j=1}^n \sum_{i=1}^4 \left | y_i - a_i \right |
abs−sum=j=1∑ni=1∑4∣yi−ai∣
例如,心跳信号为1,会通过编码转成
[
0
,
1
,
0
,
0
]
[0,1,0,0]
[0,1,0,0],预测不同心跳信号概率为
[
0.1
,
0.7
,
0.1
,
0.1
]
[0.1,0.7,0.1,0.1]
[0.1,0.7,0.1,0.1]那么这个预测结果的
a
b
s
−
s
u
m
abs-sum
abs−sum为:
a
b
s
−
s
u
m
=
∣
0.1
−
0
∣
+
∣
0.7
−
1
∣
+
∣
0.1
−
0
∣
+
∣
0.1
−
0
∣
=
0.6
abs - sum = \left | 0.1 - 0 \right | + \left | 0.7 - 1 \right | + \left | 0.1 - 0 \right | + \left | 0.1 - 0 \right | = 0.6
abs−sum=∣0.1−0∣+∣0.7−1∣+∣0.1−0∣+∣0.1−0∣=0.6
1.3 结果提交
提交前请确保预测结果的格式与sample_submit.csv中的格式一致,以及提交文件后缀名为csv。
形式如下:
id,label_0,label_1,label_2,label_3
100000,0,0,0,0
100001,0,0,0,0
100002,0,0,0,0
100003,0,0,0,0
2. baseline
2.1 导入相关包
import os
import gc
import math
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')
2.2 模型训练
def abs_sum(y_pre,y_tru):
y_pre=np.array(y_pre)
y_tru=np.array(y_tru)
loss=sum(sum(abs(y_pre-y_tru)))
return loss
def cv_model(clf, train_x, train_y, test_x, clf_name):
folds = 5
seed = 2021
kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
test = np.zeros((test_x.shape[0],4))
cv_scores = []
onehot_encoder = OneHotEncoder(sparse=False)
for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
print('************************************ {} ************************************'.format(str(i+1)))
trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
if clf_name == "lgb":
train_matrix = clf.Dataset(trn_x, label=trn_y)
valid_matrix = clf.Dataset(val_x, label=val_y)
params = {
'boosting_type': 'gbdt',
'objective': 'multiclass',
'num_class': 4,
'num_leaves': 2 ** 5,
'feature_fraction': 0.8,
'bagging_fraction': 0.8,
'bagging_freq': 4,
'learning_rate': 0.1,
'seed': seed,
'nthread': 28,
'n_jobs':24,
'verbose': -1,
}
model = clf.train(params,
train_set=train_matrix,
valid_sets=valid_matrix,
num_boost_round=2000,
verbose_eval=100,
early_stopping_rounds=200)
val_pred = model.predict(val_x, num_iteration=model.best_iteration)
test_pred = model.predict(test_x, num_iteration=model.best_iteration)
val_y=np.array(val_y).reshape(-1, 1)
val_y = onehot_encoder.fit_transform(val_y)
print('预测的概率矩阵为:')
print(test_pred)
test += test_pred
score=abs_sum(val_y, val_pred)
cv_scores.append(score)
print(cv_scores)
print("%s_scotrainre_list:" % clf_name, cv_scores)
print("%s_score_mean:" % clf_name, np.mean(cv_scores))
print("%s_score_std:" % clf_name, np.std(cv_scores))
test=test/kf.n_splits
return test
def lgb_model(x_train, y_train, x_test):
lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
return lgb_test
lgb_test = lgb_model(x_train, y_train, x_test)
temp=pd.DataFrame(lgb_test)
result=pd.read_csv('sample_submit.csv')
result['label_0']=temp[0]
result['label_1']=temp[1]
result['label_2']=temp[2]
result['label_3']=temp[3]
result.to_csv('submit.csv',index=False)