用户新增预测挑战赛Baseline注释版
背景:数据由约62万条训练集、20万条测试集数据组成,共包含13个字段。其中uuid为样本唯一标识,eid为访问行为ID,udmap为行为属性,其中的key1到key9表示不同的行为属性,如项目名、项目id等相关字段,common_ts为应用访问记录发生时间(毫秒时间戳),其余字段x1至x8为用户相关的属性,为匿名处理字段。target字段为预测目标,即是否为新增用户。
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'
1.导入需要的packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
2.csv数据文件读取函数
def ReadData(path):
train_data = pd.read_csv(path + 'train.csv')
test_data = pd.read_csv(path + 'test.csv')
return train_data,test_data
2.1针对数据的初步观察
train_data, test_data = ReadData('用户新增预测挑战赛公开数据/')
train_data.head()
train_data.info()
train_data.describe()
|
uuid |
eid |
udmap |
common_ts |
x1 |
x2 |
x3 |
x4 |
x5 |
x6 |
x7 |
x8 |
target |
0 |
0 |
26 |
{"key3":"67804","key2":"650"} |
1689673468244 |
4 |
0 |
41 |
107 |
206 |
1 |
0 |
1 |
0 |
1 |
1 |
26 |
{"key3":"67804","key2":"484"} |
1689082941469 |
4 |
0 |
41 |
24 |
283 |
4 |
8 |
1 |
0 |
2 |
2 |
8 |
unknown |
1689407393040 |
4 |
0 |
41 |
71 |
288 |
4 |
7 |
1 |
0 |
3 |
3 |
11 |
unknown |
1689467815688 |
1 |
3 |
41 |
17 |
366 |
1 |
6 |
1 |
0 |
4 |
4 |
26 |
{"key3":"67804","key2":"650"} |
1689491751442 |
0 |
3 |
41 |
92 |
383 |
4 |
8 |
1 |
0 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 620356 entries, 0 to 620355
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 uuid 620356 non-null int64
1 eid 620356 non-null int64
2 udmap 620356 non-null object
3 common_ts 620356 non-null int64
4 x1 620356 non-null int64
5 x2 620356 non-null int64
6 x3 620356 non-null int64
7 x4 620356 non-null int64
8 x5 620356 non-null int64
9 x6 620356 non-null int64
10 x7 620356 non-null int64
11 x8 620356 non-null int64
12 target 620356 non-null int64
dtypes: int64(12), object(1)
memory usage: 61.5+ MB
|
uuid |
eid |
common_ts |
x1 |
x2 |
x3 |
x4 |
x5 |
x6 |
x7 |
x8 |
target |
count |
620356.000000 |
620356.000000 |
6.203560e+05 |
620356.000000 |
620356.000000 |
620356.000000 |
620356.000000 |
620356.000000 |
620356.000000 |
620356.000000 |
620356.000000 |
620356.000000 |
mean |
310177.500000 |
22.148287 |
1.689317e+12 |
2.675723 |
1.106350 |
40.974499 |
82.860080 |
224.909096 |
2.901681 |
5.863720 |
0.855459 |
|