2020-03-16DW数据挖掘心电图baseline

本次竞赛聚焦于利用机器学习算法预测心电图心跳信号类别。数据集包含20万条心电信号记录,每条记录长度一致。选手需使用如XGBoost、LightGBM等算法训练模型,并通过绝对误差评估预测精度。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

赛题概况
  • 比赛要求参赛选手根据给定的数据集,建立模型,预测不同的心跳信号。
  • 赛题以预测心电图心跳信号类别为任务,该数据来自某平台心电图数据记录,总数据量超过20万
  • 主要为1列心跳信号序列数据,其中每个样本的信号序列采样频次一致,长度相等。
  • 从中抽取10万条作为训练集,2万条作为测试集A,2万条作为测试集B,同时会对心跳信号类别(label)信息进行脱敏。
数据概况
  1. train.csv
  • id 为心跳信号分配的唯一标识
  • heartbeat_signals 心跳信号序列(数据之间采用“,”进行分隔)
  • label 心跳信号类别(0、1、2、3)
  1. testA.csv
  • id 心跳信号分配的唯一标识
  • heartbeat_signals 心跳信号序列(数据之间采用“,”进行分隔)
应用方法
  • xgb、lgb、catboost
评分体系
  • 各个预测结果与真实值的绝对值进行求个,分值越低代表预测越准确
    选手需提交4种不同心跳信号预测的概率,选手提交结果与实际心跳类型结果进行对比,求预测的概率与真实值差值的绝对值。

具体计算公式如下:

总共有n个病例,针对某一个信号,若真实值为[y1,y2,y3,y4],模型预测概率值为[a1,a2,a3,a4],那么该模型的评价指标abs-sum为
abs−sum=∑j=1n∑i=14∣yi−ai∣ {abs-sum={\mathop{ \sum }\limits_{{j=1}}^{{n}}{{\mathop{ \sum }\limits_{{i=1}}^{{4}}{{ \left| {y\mathop{{}}\nolimits_{{i}}-a\mathop{{}}\nolimits_{{i}}} \right| }}}}}} abssum=j=1ni=14yiai
例如,某心跳信号类别为1,通过编码转成[0,1,0,0],预测不同心跳信号概率为[0.1,0.7,0.1,0.1],那么这个信号预测结果的abs-sum为
abs−sum=∣0.1−0∣+∣0.7−1∣+∣0.1−0∣+∣0.1−0∣=0.6 {abs-sum={ \left| {0.1-0} \right| }+{ \left| {0.7-1} \right| }+{ \left| {0.1-0} \right| }+{ \left| {0.1-0} \right| }=0.6} abssum=0.10+0.71+0.10+0.10=0.6

import os
import gc
import math

import pandas as pd
import numpy as np

import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
from sklearn.linear_model import SGDRegressor, LinearRegression, Ridge
from sklearn.preprocessing import MinMaxScaler


from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

from tqdm import tqdm
import matplotlib.pyplot as plt
import time
import warnings
warnings.filterwarnings('ignore')

读取数据

train = pd.read_csv('train.csv')
test=pd.read_csv('testA.csv')
train.head()
idheartbeat_signalslabel
000.9912297987616655,0.9435330436439665,0.764677...0.0
110.9714822034884503,0.9289687459588268,0.572932...0.0
221.0,0.9591487564065292,0.7013782792997189,0.23...2.0
330.9757952826275774,0.9340884687738161,0.659636...0.0
440.0,0.055816398940721094,0.26129357194994196,0...2.0
test.head()
idheartbeat_signals
01000000.9915713654170097,1.0,0.6318163407681274,0.13...
11000010.6075533139615096,0.5417083883163654,0.340694...
21000020.9752726292239277,0.6710965234906665,0.686758...
31000030.9956348033996116,0.9170249621481004,0.521096...
41000041.0,0.8879490481178918,0.745564725322326,0.531...

数据预处理

def reduce_mem_usage(df):
    start_mem = df.memory_usage().sum() / 1024**2 
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))
    
    for col in df.columns:
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype('category')

    end_mem = df.memory_usage().sum() / 1024**2 
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
    
    return df
train.columns
Index(['id', 's_0', 's_1', 's_2', 's_3', 's_4', 's_5', 's_6', 's_7', 's_8',
       ...
       's_196', 's_197', 's_198', 's_199', 's_200', 's_201', 's_202', 's_203',
       's_204', 'label'],
      dtype='object', length=207)
# 简单预处理
train_list = []

for items in train.values:
    train_list.append([items[0]] + [float(i) for i in items[1].split(',')] + [items[2]])

train = pd.DataFrame(np.array(train_list))
train.columns = ['id'] + ['s_'+str(i) for i in range(len(train_list[0])-2)] + ['label']
train = reduce_mem_usage(train)

test_list=[]
for items in test.values:
    test_list.append([items[0]] + [float(i) for i in items[1].split(',')])

test = pd.DataFrame(np.array(test_list))
test.columns = ['id'] + ['s_'+str(i) for i in range(len(test_list[0])-1)]
test = reduce_mem_usage(test)

Memory usage of dataframe is 157.93 MB
Memory usage after optimization is: 39.67 MB
Decreased by 74.9%
Memory usage of dataframe is 31.43 MB
Memory usage after optimization is: 7.90 MB
Decreased by 74.9%
train.head()
ids_0s_1s_2s_3s_4s_5s_6s_7s_8...s_196s_197s_198s_199s_200s_201s_202s_203s_204label
00.00.9912110.9433590.7646480.6186520.3796390.1907960.0402220.0260010.031708...0.00.00.00.00.00.00.00.00.00.0
11.00.9716800.9291990.5727540.1784670.1229860.1323240.0944210.0896000.030487...0.00.00.00.00.00.00.00.00.00.0
22.01.0000000.9589840.7011720.2318120.0000000.0806880.1284180.1875000.280762...0.00.00.00.00.00.00.00.00.02.0
33.00.9755860.9340820.6596680.2498780.2370610.2814940.2498780.2498780.241455...0.00.00.00.00.00.00.00.00.00.0
44.00.0000000.0558170.2612300.3598630.4331050.4536130.4990230.5429690.616699...0.00.00.00.00.00.00.00.00.02.0

5 rows × 207 columns

test.head()
ids_0s_1s_2s_3s_4s_5s_6s_7s_8...s_195s_196s_197s_198s_199s_200s_201s_202s_203s_204
0100000.00.9916991.0000000.6318360.1362300.0414120.1027220.1208500.1234130.107910...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00000
1100001.00.6074220.5415040.3405760.0000000.0906980.1649170.1950680.1688230.198853...0.3898930.3869630.3671880.3640140.3605960.3571780.3505860.3505860.3505860.36377
2100002.00.9750980.6708980.6865230.7084960.7187500.7167970.7207030.7016600.596680...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00000
3100003.00.9956050.9169920.5209960.0000000.2218020.4040530.4904790.5273440.518066...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00000
4100004.01.0000000.8881840.7456050.5317380.3803710.2246090.0911250.0576480.003914...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.00000

5 rows × 206 columns

训练数据/测试数据准备

x_train = train.drop(['id','label'], axis=1)
y_train = train['label']
x_test=test.drop(['id'], axis=1)
x_train
s_0s_1s_2s_3s_4s_5s_6s_7s_8s_9...s_195s_196s_197s_198s_199s_200s_201s_202s_203s_204
00.9912110.9433590.7646480.6186520.3796390.1907960.0402220.0260010.0317080.065552...0.00.00.00.00.00.00.00.00.00.0
10.9716800.9291990.5727540.1784670.1229860.1323240.0944210.0896000.0304870.040497...0.00.00.00.00.00.00.00.00.00.0
21.0000000.9589840.7011720.2318120.0000000.0806880.1284180.1875000.2807620.328369...0.00.00.00.00.00.00.00.00.00.0
30.9755860.9340820.6596680.2498780.2370610.2814940.2498780.2498780.2414550.230713...0.00.00.00.00.00.00.00.00.00.0
40.0000000.0558170.2612300.3598630.4331050.4536130.4990230.5429690.6166990.676758...0.00.00.00.00.00.00.00.00.00.0
..................................................................
999951.0000000.6777340.2224120.2570800.2047120.0546570.0261540.1181640.2448730.328857...0.00.00.00.00.00.00.00.00.00.0
999960.9267580.9062500.6372070.4150390.3747560.3825680.3588870.3413090.3364260.317139...0.00.00.00.00.00.00.00.00.00.0
999970.9257810.5874020.6333010.6323240.6391600.6142580.5991210.5175780.4038090.253174...0.00.00.00.00.00.00.00.00.00.0
999981.0000000.9946290.8295900.4582520.2641600.2402340.2137450.1893310.2038570.210815...0.00.00.00.00.00.00.00.00.00.0
999990.9257810.9165040.4042970.0000000.2629390.3854980.3610840.3327640.3398440.350586...0.00.00.00.00.00.00.00.00.00.0

100000 rows × 205 columns

y_train
0        0.0
1        0.0
2        2.0
3        0.0
4        2.0
        ... 
99995    0.0
99996    2.0
99997    3.0
99998    2.0
99999    0.0
Name: label, Length: 100000, dtype: float16

模型训练

# 输出结果的样式
def abs_sum(y_pre,y_tru):
    y_pre=np.array(y_pre)
    y_tru=np.array(y_tru)
    loss=sum(sum(abs(y_pre-y_tru)))
    return loss
    
def cv_model(clf, train_x, train_y, test_x, clf_name):
    folds = 5
    seed = 2021
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)
    test = np.zeros((test_x.shape[0],4))

    cv_scores = []
    onehot_encoder = OneHotEncoder(sparse=False)
    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):
        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]
        
        if clf_name == "lgb":
            train_matrix = clf.Dataset(trn_x, label=trn_y)
            valid_matrix = clf.Dataset(val_x, label=val_y)

            params = {
                'boosting_type': 'gbdt',
                'objective': 'multiclass',
                'num_class': 4,
                'num_leaves': 2 ** 5,
                'feature_fraction': 0.8,
                'bagging_fraction': 0.8,
                'bagging_freq': 4,
                'learning_rate': 0.1,
                'seed': seed,
                'nthread': 28,
                'n_jobs':24,
                'verbose': -1,
            }

            model = clf.train(params, 
                      train_set=train_matrix, 
                      valid_sets=valid_matrix, 
                      num_boost_round=2000, 
                      verbose_eval=100, 
                      early_stopping_rounds=200)
            val_pred = model.predict(val_x, num_iteration=model.best_iteration)
            test_pred = model.predict(test_x, num_iteration=model.best_iteration) 
            
        val_y=np.array(val_y).reshape(-1, 1)
        val_y = onehot_encoder.fit_transform(val_y)
        print('预测的概率矩阵为:')
        print(test_pred)
        test += test_pred
        score=abs_sum(val_y, val_pred)
        cv_scores.append(score)
        print(cv_scores)
    print("%s_scotrainre_list:" % clf_name, cv_scores)
    print("%s_score_mean:" % clf_name, np.mean(cv_scores))
    print("%s_score_std:" % clf_name, np.std(cv_scores))
    test=test/kf.n_splits

    return test
def lgb_model(x_train, y_train, x_test):
    lgb_test = cv_model(lgb, x_train, y_train, x_test, "lgb")
    return lgb_test
lgb_test = lgb_model(x_train, y_train, x_test)
************************************ 1 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0525735
[200]	valid_0's multi_logloss: 0.0422444
[300]	valid_0's multi_logloss: 0.0407076
[400]	valid_0's multi_logloss: 0.0420398
Early stopping, best iteration is:
[289]	valid_0's multi_logloss: 0.0405457
预测的概率矩阵为:
[[9.99969791e-01 2.85197261e-05 1.00341946e-06 6.85357631e-07]
 [7.93287264e-05 7.69060914e-04 9.99151590e-01 2.00810971e-08]
 [5.75356884e-07 5.04051497e-08 3.15322414e-07 9.99999059e-01]
 ...
 [6.79267940e-02 4.30206297e-04 9.31640185e-01 2.81516302e-06]
 [9.99960477e-01 3.94098074e-05 8.34030725e-08 2.94638661e-08]
 [9.88705846e-01 2.14081630e-03 6.67418381e-03 2.47915423e-03]]
[607.0736049372185]
************************************ 2 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0566626
[200]	valid_0's multi_logloss: 0.0450852
[300]	valid_0's multi_logloss: 0.044078
[400]	valid_0's multi_logloss: 0.0455546
Early stopping, best iteration is:
[275]	valid_0's multi_logloss: 0.0437793
预测的概率矩阵为:
[[9.99991401e-01 7.69109547e-06 6.65504756e-07 2.42084688e-07]
 [5.72380482e-05 1.32812809e-03 9.98614607e-01 2.66534396e-08]
 [2.82123411e-06 4.13195205e-07 1.34026965e-06 9.99995425e-01]
 ...
 [6.96398024e-02 6.52459907e-04 9.29685742e-01 2.19960932e-05]
 [9.99972366e-01 2.75069005e-05 7.68142933e-08 5.07415018e-08]
 [9.67263676e-01 7.26154408e-03 2.41533542e-02 1.32142531e-03]]
[607.0736049372185, 623.4313863731124]
************************************ 3 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0498722
[200]	valid_0's multi_logloss: 0.038028
[300]	valid_0's multi_logloss: 0.0358066
[400]	valid_0's multi_logloss: 0.0361478
[500]	valid_0's multi_logloss: 0.0379597
Early stopping, best iteration is:
[340]	valid_0's multi_logloss: 0.0354344
预测的概率矩阵为:
[[9.99972032e-01 2.62406774e-05 1.17282152e-06 5.54230651e-07]
 [1.05242811e-05 6.50215805e-05 9.99924453e-01 6.93812546e-10]
 [1.93240868e-06 1.10384984e-07 3.76773426e-07 9.99997580e-01]
 ...
 [1.34894410e-02 3.84569683e-05 9.86471555e-01 5.46564350e-07]
 [9.99987431e-01 1.25532882e-05 1.03902298e-08 5.46727770e-09]
 [9.78722948e-01 1.06329839e-02 6.94192038e-03 3.70214810e-03]]
[607.0736049372185, 623.4313863731124, 508.02381607269535]
************************************ 4 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0564768
[200]	valid_0's multi_logloss: 0.0448698
[300]	valid_0's multi_logloss: 0.0446719
[400]	valid_0's multi_logloss: 0.0470399
Early stopping, best iteration is:
[250]	valid_0's multi_logloss: 0.0438853
预测的概率矩阵为:
[[9.99979692e-01 1.70821979e-05 1.27048476e-06 1.95571841e-06]
 [5.66207785e-05 4.02275314e-04 9.99541086e-01 1.82828519e-08]
 [2.62267451e-06 3.58613522e-07 4.78645006e-06 9.99992232e-01]
 ...
 [4.56636552e-02 5.69497433e-04 9.53758468e-01 8.37980573e-06]
 [9.99896785e-01 1.02796802e-04 2.46636563e-07 1.72061021e-07]
 [8.70911669e-01 1.73790185e-02 1.04478175e-01 7.23113697e-03]]
[607.0736049372185, 623.4313863731124, 508.02381607269535, 660.4867407547266]
************************************ 5 ************************************
[LightGBM] [Warning] num_threads is set with nthread=28, will be overridden by n_jobs=24. Current value: num_threads=24
Training until validation scores don't improve for 200 rounds
[100]	valid_0's multi_logloss: 0.0506398
[200]	valid_0's multi_logloss: 0.0396422
[300]	valid_0's multi_logloss: 0.0381065
[400]	valid_0's multi_logloss: 0.0390162
[500]	valid_0's multi_logloss: 0.0414986
Early stopping, best iteration is:
[324]	valid_0's multi_logloss: 0.0379497
预测的概率矩阵为:
[[9.99993352e-01 6.02902202e-06 1.13002685e-07 5.06277302e-07]
 [1.03959552e-05 5.03778956e-04 9.99485820e-01 5.07638601e-09]
 [1.92568065e-07 5.07155306e-08 4.94690856e-08 9.99999707e-01]
 ...
 [8.83103121e-03 2.51969353e-05 9.91142776e-01 9.96143937e-07]
 [9.99984791e-01 1.51997858e-05 5.62426491e-09 3.80450197e-09]
 [9.86084001e-01 8.75968498e-04 1.09742304e-02 2.06580027e-03]]
[607.0736049372185, 623.4313863731124, 508.02381607269535, 660.4867407547266, 539.2160054696064]
lgb_scotrainre_list: [607.0736049372185, 623.4313863731124, 508.02381607269535, 660.4867407547266, 539.2160054696064]
lgb_score_mean: 587.6463107214719
lgb_score_std: 55.944536405714565
lgb_test
array([[9.99981254e-01, 1.71125438e-05, 8.45046636e-07, 7.88733736e-07],
       [4.28215579e-05, 6.13652971e-04, 9.99343511e-01, 1.41575174e-08],
       [1.62884845e-06, 1.96662878e-07, 1.37365693e-06, 9.99996801e-01],
       ...,
       [4.11101448e-02, 3.43163508e-04, 9.58539745e-01, 6.94675406e-06],
       [9.99960370e-01, 3.94933168e-05, 8.45736848e-08, 5.23076338e-08],
       [9.58337628e-01, 7.65806626e-03, 3.06443728e-02, 3.35993298e-03]])
temp=pd.DataFrame(lgb_test)
temp
0123
00.9999811.711254e-058.450466e-077.887337e-07
10.0000436.136530e-049.993435e-011.415752e-08
20.0000021.966629e-071.373657e-069.999968e-01
30.9999701.909713e-051.097002e-053.576703e-08
40.9999831.769712e-061.482817e-051.966254e-07
...............
199950.9980963.060176e-041.085313e-041.489757e-03
199960.9998461.436305e-041.074898e-058.837766e-08
199970.0411103.431635e-049.585397e-016.946754e-06
199980.9999603.949332e-058.457368e-085.230763e-08
199990.9583387.658066e-033.064437e-023.359933e-03

20000 rows × 4 columns

result=pd.read_csv('sample_submit.csv',encoding='gbk')
result['label_0']=temp[0]
result['label_1']=temp[1]
result['label_2']=temp[2]
result['label_3']=temp[3]
result.to_csv('submit.csv',index=False)
submit_data = pd.read_csv('submit.csv')
submit_data
idlabel_0label_1label_2label_3
0NaN0.9999811.711254e-058.450466e-077.887337e-07
1NaN0.0000436.136530e-049.993435e-011.415752e-08
2NaN0.0000021.966629e-071.373657e-069.999968e-01
3NaN0.9999701.909713e-051.097002e-053.576703e-08
4NaN0.9999831.769712e-061.482817e-051.966254e-07
..................
19995NaN0.9980963.060176e-041.085313e-041.489757e-03
19996NaN0.9998461.436305e-041.074898e-058.837766e-08
19997NaN0.0411103.431635e-049.585397e-016.946754e-06
19998NaN0.9999603.949332e-058.457368e-085.230763e-08
19999NaN0.9583387.658066e-033.064437e-023.359933e-03

20000 rows × 5 columns



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值