赛题介绍
给出到参赛选手的数据
商品属性表: 数据中共涉及2840536个商品,对于其中大部分商品,我们都会给出该商品的类目id、店铺id以及加密价格,其中价格的加密函数f(x)为一个单调增函数。
训练数据: 给出xx国的用户的购买数据和yy国的A部分用户的购买数据。数据的整体统计信息如下:
国家 | 记录数 | 买家数 |
---|---|---|
xx | 10635642 | 670631 |
yy | 2232867 | 138678 |
测试数据:给出yy国的B部分用户的购买数据除掉最后一条。数据的整体统计信息如下:
国家 | 记录数 | 买家数 |
---|---|---|
yy | 166832 | 11398 |
商品属性表、训练数据、测试数据对应的文件列表为:item_attr, train和test。
数据格式:
无论是训练数据还是测试数据,都具有如下的格式:
buyer_country_id | buyer_admin_id | item_id | create_order_time | irank |
---|---|---|---|---|
xx | 817731 | 4033525 | 2018-06-12 07:12:58 | 1 |
xx | 817731 | 98120 | 2018-06-11 07:12:58 | 2 |
其中各字段含义如下:
buyer_country_id: 买家国家id, 只有’xx’和’yy’两种取值
buyer_admin_id: 买家id
item_id: 商品id
create_order_time: 订单创建时间
irank: 每个买家对应的所有记录按照时间顺序的逆排序
数据集特点:
1)每个用户有至少7条购买数据;
2)测试数据中每个用户的最后一条购买数据所对应的商品一定在训练数据中出现过;
3)少量用户在两个国家有购买记录,评测中将忽略这部分记录;
要求选手提交的数据
关于yy国的B部分用户每个用户的最后一条购买数据的预测Top30
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gc
%matplotlib inline
# 禁用科学计数法
pd.set_option('display.float_format',lambda x : '%.2f' % x)
# 读取数据
item = pd.read_csv('./Antai_AE_round1_item_attr_20190626.csv')
train = pd.read_csv('./Antai_AE_round1_train_20190626.csv')
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
submit = pd.read_csv('./Antai_AE_round1_submit_20190715.csv')
数据预处理
- 合并train和test文件
- 提取日期年月日等信息
- 关联商品价格、品类、店铺
- 转化每列数据类型为可存储的最小值,减少内存消耗
- 保存为hdf5格式文件,加速读取
# 合并train和test文件
df = pd.concat([train.assign(is_train=1), test.assign(is_train=0)])
# 提取日期年月日等信息
df['create_order_time'] = pd.to_datetime(df['create_order_time'])
df['date'] = df['create_order_time'].dt.date
df['day'] = df['create_order_time'].dt.day
df['hour'] = df['create_order_time'].dt.hour
df = pd.merge(df, item, how='left', on='item_id')
memory = df.memory_usage().sum() / 1024**2
print('Before memory usage of properties dataframe is :', memory, " MB")
# 转化每列数据类型为可存储的最小值,减少内存消耗
dtype_dict = {'buyer_admin_id' : 'int32',
'item_id' : 'int32',
'store_id' : pd.Int32Dtype(),
'irank' : 'int16',
'item_price' : pd.Int16Dtype(),
'cate_id' : pd.Int16Dtype(),
'is_train' : 'int8',
'day' : 'int8',
'hour' : 'int8',
}
df = df.astype(dtype_dict)
memory = df.memory_usage().sum() / 1024**2
print('After memory usage of properties dataframe is :', memory, " MB")
del train,test; gc.collect()
Before memory usage of properties dataframe is : 1093.9694747924805 MB
After memory usage of properties dataframe is : 596.7106781005859 MB
0
# 保存为hdf5格式文件,加速读取
for col in ['store_id', 'item_price', 'cate_id']:
df[col] = df[col].fillna(0).astype(np.int32).replace(0, np.nan)
df.to_hdf('./train_test.h5', '1.0')
/opt/anaconda3/lib/python3.8/site-packages/tables/path.py:137: NaturalNameWarning: object name is not a valid Python identifier: '1.0'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
check_attribute_name(name)
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/1022970888.py:4: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block5_values] [items->Index(['buyer_country_id', 'date'], dtype='object')]
df.to_hdf('./train_test.h5', '1.0')
# 采用hdf5格式存储,读取时间从9秒减少到仅需5秒
%%time
df = pd.read_hdf('./train_test.h5', '1.0')
CPU times: user 3.84 s, sys: 1.49 s, total: 5.33 s
Wall time: 5.39 s
%%time
train = pd.read_csv('./Antai_AE_round1_train_20190626.csv')
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
item = pd.read_csv('./Antai_AE_round1_item_attr_20190626.csv')
del train, test; gc.collect()
CPU times: user 8.65 s, sys: 716 ms, total: 9.36 s
Wall time: 9.51 s
3024
数据探索
- 用户、商品、店铺、品类乃至商品价格都是从1开始用整数编号
- 订单日期格式为:YYYY-mm-dd HH:mm:ss
- 源数据中没有空值,但是由于某些商品,不在商品表,因此缺少了一些价格、品类信息。
df.head()
buyer_country_id | buyer_admin_id | item_id | create_order_time | irank | is_train | date | day | hour | cate_id | store_id | item_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | xx | 8362078 | 1 | 2018-08-10 23:49:44 | 12 | 1 | 2018-08-10 | 10 | 23 | 2324.00 | 10013.00 | 4501.00 |
1 | xx | 9694304 | 2 | 2018-08-03 23:55:07 | 9 | 1 | 2018-08-03 | 3 | 23 | 3882.00 | 4485.00 | 2751.00 |
2 | yy | 101887 | 3 | 2018-08-27 08:31:26 | 3 | 1 | 2018-08-27 | 27 | 8 | 155.00 | 8341.00 | 656.00 |
3 | xx | 8131786 | 3 | 2018-08-31 06:00:19 | 9 | 1 | 2018-08-31 | 31 | 6 | 155.00 | 8341.00 | 656.00 |
4 | xx | 9778613 | 5 | 2018-08-21 06:01:56 | 14 | 1 | 2018-08-21 | 21 | 6 | 1191.00 | 1949.00 | 1689.00 |
# Null 空值统计
for pdf in [df, item]:
for col in pdf.columns:
print(col, pdf[col].isnull().sum())
buyer_country_id 0
buyer_admin_id 0
item_id 0
create_order_time 0
irank 0
is_train 0
date 0
day 0
hour 0
cate_id 26119
store_id 26119
item_price 26119
item_id 0
cate_id 0
store_id 0
item_price 0
df.describe()
buyer_admin_id | item_id | create_order_time | irank | is_train | day | hour | cate_id | store_id | item_price | |
---|---|---|---|---|---|---|---|---|---|---|
count | 13035341.00 | 13035341.00 | 13035341 | 13035341.00 | 13035341.00 | 13035341.00 | 13035341.00 | 13009222.00 | 13009222.00 | 13009222.00 |
mean | 6527293.86 | 6522519.78 | 2018-08-18 23:20:45.258000384 | 143.62 | 0.99 | 18.62 | 9.06 | 1498.53 | 40575.67 | 1099.75 |
min | 1.00 | 1.00 | 2018-07-13 05:54:54 | -32768.00 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 |
25% | 3269515.00 | 3261386.00 | 2018-08-10 19:40:33 | 4.00 | 1.00 | 10.00 | 4.00 | 616.00 | 20648.00 | 123.00 |
50% | 6528429.00 | 6522878.00 | 2018-08-19 13:55:45 | 8.00 | 1.00 | 19.00 | 8.00 | 1505.00 | 39368.00 | 246.00 |
75% | 9787265.00 | 9784900.00 | 2018-08-27 11:57:00 | 16.00 | 1.00 | 27.00 | 13.00 | 2010.00 | 59273.00 | 700.00 |
max | 13046721.00 | 13046734.00 | 2018-08-31 23:59:57 | 32767.00 | 1.00 | 31.00 | 23.00 | 4243.00 | 95105.00 | 20230.00 |
std | 3764280.24 | 3765432.09 | NaN | 1573.84 | 0.11 | 9.21 | 6.56 | 903.26 | 24284.46 | 2880.00 |
item.describe()
item_id | cate_id | store_id | item_price | |
---|---|---|---|---|
count | 2832669.00 | 2832669.00 | 2832669.00 | 2832669.00 |
mean | 6429138.00 | 1481.10 | 40256.46 | 1124.00 |
std | 3725431.44 | 923.09 | 24370.92 | 2110.62 |
min | 1.00 | 1.00 | 1.00 | 1.00 |
25% | 3224114.00 | 600.00 | 19850.00 | 180.00 |
50% | 6391845.00 | 1499.00 | 38954.00 | 400.00 |
75% | 9636216.00 | 2050.00 | 58406.00 | 1200.00 |
max | 13046734.00 | 4243.00 | 95105.00 | 20230.00 |
数据探查
训练集与测试集
train = df['is_train']==1
test = df['is_train']==0
train_count = len(df[train])
print('训练集样本量是',train_count)
test_count = len(df[test])
print('测试集样本量是',test_count)
print('样本比例为:', train_count/test_count)
训练集样本量是 12868509
测试集样本量是 166832
样本比例为: 77.13453653975256
# buyer_country_id 国家编号
def groupby_cnt_ratio(df, col):
if isinstance(col, str):
col = [col]
key = ['is_train', 'buyer_country_id'] + col
# groupby function
cnt_stat = df.groupby(key).size().to_frame('count')
ratio_stat = (cnt_stat / cnt_stat.groupby(['is_train', 'buyer_country_id']).sum()).rename(columns={'count':'count_ratio'})
return pd.merge(cnt_stat, ratio_stat, on=key, how='outer').sort_values(by=['count'], ascending=False)
groupby_cnt_ratio(df, [])
count | count_ratio | ||
---|---|---|---|
is_train | buyer_country_id | ||
1 | xx | 10635642 | 1.00 |
yy | 2232867 | 1.00 | |
0 | yy | 166832 | 1.00 |
plt.figure(figsize=(8,6))
sns.countplot(x='is_train', data = df, palette=['red', 'blue'], hue='buyer_country_id', order=[1, 0])
plt.xticks(np.arange(2), ('training set', 'testing set'))
plt.xlabel('data')
plt.title('country id');

- 训练集中有2个国家数据,xx国家样本数10635642,占比83%,yy国家样本数2232867条,仅占17%
- 预测集中有yy国家的166832数据, 训练集中yy国样本数量是测试集中的13倍,如赛题目的所言,期望通过大量成熟国家来预测少量带成熟国家的用户购买行为
buyer_admin_id 用户编号
print('训练集中用户数量',len(df[train]['buyer_admin_id'].unique()))
print('测试集中用户数量',len(df[test]['buyer_admin_id'].unique()))
训练集中用户数量 809213
测试集中用户数量 11398
union = list(set(df[train]['buyer_admin_id'].unique()).intersection(set(df[test]['buyer_admin_id'].unique())))
print('同时在训练集测试集出现的有6位用户,id如下:',union)
同时在训练集测试集出现的有6位用户,id如下: [12647969, 13000419, 3106927, 12858772, 12929117, 12368445]
df[train][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(10)
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/3035833051.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df[train][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(10)
buyer_country_id | buyer_admin_id | item_id | create_order_time | irank | is_train | date | day | hour | cate_id | store_id | item_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
7546704 | xx | 3106927 | 7645546 | 2018-08-30 02:49:22 | 1 | 1 | 2018-08-30 | 30 | 2 | 1164.00 | 73781.00 | 770.00 |
4582539 | xx | 3106927 | 4639151 | 2018-08-30 02:49:22 | 2 | 1 | 2018-08-30 | 30 | 2 | 2214.00 | 53190.00 | 1669.00 |
11953258 | xx | 3106927 | 12122118 | 2018-08-30 02:49:22 | 3 | 1 | 2018-08-30 | 30 | 2 | 236.00 | 73781.00 | 884.00 |
255625 | xx | 3106927 | 258860 | 2018-08-30 02:49:22 | 4 | 1 | 2018-08-30 | 30 | 2 | 189.00 | 24221.00 | 900.00 |
7402817 | xx | 3106927 | 7499372 | 2018-08-30 02:49:22 | 5 | 1 | 2018-08-30 | 30 | 2 | 2214.00 | 32535.00 | 2714.00 |
9483312 | xx | 3106927 | 9613063 | 2018-08-30 02:49:22 | 6 | 1 | 2018-08-30 | 30 | 2 | 3069.00 | 73781.00 | 110.00 |
2740080 | xx | 3106927 | 2773189 | 2018-08-27 08:18:23 | 10 | 1 | 2018-08-27 | 27 | 8 | 1865.00 | 49499.00 | 20067.00 |
12152249 | xx | 3106927 | 12324030 | 2018-08-27 07:15:05 | 11 | 1 | 2018-08-27 | 27 | 7 | 880.00 | 92968.00 | 1764.00 |
2201292 | xx | 3106927 | 2227720 | 2018-08-19 02:36:36 | 12 | 1 | 2018-08-19 | 19 | 2 | 1164.00 | 6404.00 | 1900.00 |
6717641 | xx | 3106927 | 6804187 | 2018-08-19 02:33:39 | 13 | 1 | 2018-08-19 | 19 | 2 | 1164.00 | 52421.00 | 230.00 |
df[test][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(3)
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/3087106983.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
df[test][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(3)
buyer_country_id | buyer_admin_id | item_id | create_order_time | irank | is_train | date | day | hour | cate_id | store_id | item_price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
13016145 | yy | 3106927 | 202354 | 2018-08-30 02:48:40 | 7 | 0 | 2018-08-30 | 30 | 2 | 642.00 | 24221.00 | 989.00 |
13008981 | yy | 3106927 | 6994414 | 2018-08-29 05:48:06 | 8 | 0 | 2018-08-29 | 29 | 5 | 7.00 | 37411.00 | 1521.00 |
13008982 | yy | 3106927 | 6994414 | 2018-08-29 05:48:06 | 9 | 0 | 2018-08-29 | 29 | 5 | 7.00 | 37411.00 | 1521.00 |
df[(train) & (df['irank']==1) & (df['buyer_admin_id'].isin(['12858772','3106927','12368445']))]
buyer_country_id | buyer_admin_id | item_id | create_order_time | irank | is_train | date | day | hour | cate_id | store_id | item_price |
---|
用户记录数分布
admin_cnt = groupby_cnt_ratio(df, 'buyer_admin_id')
admin_cnt.groupby(['is_train','buyer_country_id']).head(3)
count | count_ratio | |||
---|---|---|---|---|
is_train | buyer_country_id | buyer_admin_id | ||
1 | xx | 10828801 | 42751 | 0.00 |
10951390 | 23569 | 0.00 | ||
11223615 | 19933 | 0.00 | ||
yy | 2381782 | 3480 | 0.00 | |
2333316 | 1944 | 0.00 | ||
2365356 | 1686 | 0.00 | ||
0 | yy | 2041038 | 1386 | 0.01 |
2070430 | 399 | 0.00 | ||
1144848 | 286 | 0.00 |
# 用户购买记录数——最多、最少、中位数
admin_cnt.groupby(['is_train','buyer_country_id'])['count'].agg(['max','min','median'])
max | min | median | ||
---|---|---|---|---|
is_train | buyer_country_id | |||
0 | yy | 1386 | 7 | 11.00 |
1 | xx | 42751 | 8 | 11.00 |
yy | 3480 | 8 | 12.00 |
- 训练集中记录了809213个用户的数据,其中id为10828801的用户拔得头筹,有42751条购买记录,用户至少都有8条记录
- 训练集中记录了11398个用户的数据,其中id为2041038的用户勇冠三军,有1386条购买记录,用户至少有7条记录
Notes: 验证集中用户最少仅有7条,是因为最后一条记录被抹去
用户记录数大都都分布在0~50,少量用户记录甚至超过了10000条,下一步对用户记录数分布继续探索
fig, ax = plt.subplots(1, 2 ,figsize=(16,6))
ax[0].set(xlabel='用户记录数')
sns.kdeplot(admin_cnt.loc[(1, 'xx')]['count'].values, ax=ax[0]).set_title('训练集--xx国用户记录数')
ax[1].legend(labels=['训练集', '测试集'], loc="upper right")
ax[1].set(xlabel='用户记录数')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].loc[(1, 'yy')]['count'].values, ax=ax[1]).set_title('yy国用户记录数')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].loc[(0, 'yy')]['count'].values, ax=ax[1]);
admin_cnt.columns = ['记录数', '占比']
admin_user_cnt = groupby_cnt_ratio(admin_cnt, '记录数')
admin_user_cnt.columns = ['人数', '人数占比']
admin_user_cnt.head()
人数 | 人数占比 | |||
---|---|---|---|---|
is_train | buyer_country_id | 记录数 | ||
1 | xx | 8 | 118155 | 0.18 |
9 | 91757 | 0.14 | ||
10 | 72936 | 0.11 | ||
11 | 57678 | 0.09 | ||
12 | 46534 | 0.07 |
# xx国——用户记录数与用户数
admin_user_cnt.loc[(1,'xx')][['人数','人数占比']].T
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/313832335.py:2: PerformanceWarning: indexing past lexsort depth may impact performance.
admin_user_cnt.loc[(1,'xx')][['人数','人数占比']].T
记录数 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | ... | 521 | 556 | 526 | 528 | 529 | 537 | 545 | 549 | 550 | 554 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
人数 | 118155.00 | 91757.00 | 72936.00 | 57678.00 | 46534.00 | 38114.00 | 31432.00 | 26735.00 | 22352.00 | 18742.00 | ... | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
人数占比 | 0.18 | 0.14 | 0.11 | 0.09 | 0.07 | 0.06 | 0.05 | 0.04 | 0.03 | 0.03 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
2 rows × 506 columns
# yy国——记录数与用户数占比
admin_user_cnt.loc[([1,0], 'yy', slice(None))][['人数', '人数占比']].unstack(0).drop(columns='人数').head(10)
人数占比 | |||
---|---|---|---|
is_train | 0 | 1 | |
buyer_country_id | 记录数 | ||
yy | 8 | 0.13 | 0.16 |
9 | 0.10 | 0.13 | |
10 | 0.09 | 0.10 | |
11 | 0.08 | 0.08 | |
12 | 0.06 | 0.07 | |
13 | 0.05 | 0.06 | |
14 | 0.04 | 0.05 | |
15 | 0.04 | 0.04 | |
16 | 0.03 | 0.03 | |
17 | 0.03 | 0.03 |
简单 baseline
# 选取用户近30次购买记录作为预测值,越近购买的商品放在越靠前的列,不够30次购买记录的用热销商品5595070填充
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
tmp = test[test['irank']<=31].sort_values(by=['buyer_country_id', 'buyer_admin_id', 'irank'])[['buyer_admin_id','item_id','irank']]
sub = tmp.set_index(['buyer_admin_id', 'irank']).unstack(-1)
sub.fillna(5595070).astype(int).reset_index().to_csv('./sub.csv', index=False, header=None)
# 最终提交文件格式
sub = pd.read_csv('./sub.csv', header = None)
sub.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 152 | 8410857 | 7937154 | 8472223 | 4016066 | 9891513 | 8064216 | 8351840 | 5595070 | 5595070 | ... | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 |
1 | 282 | 11721802 | 7665423 | 7665423 | 10808393 | 11310708 | 623582 | 6547607 | 2605373 | 688799 | ... | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 |
2 | 321 | 1461800 | 7379845 | 9243286 | 7379845 | 627849 | 5000759 | 11774753 | 10932288 | 4813286 | ... | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 |
3 | 809 | 2347616 | 5707010 | 6339286 | 5492003 | 1207574 | 5707010 | 5492003 | 1207574 | 2262443 | ... | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 |
4 | 870 | 11382694 | 5999244 | 6611583 | 7412272 | 4343647 | 5546383 | 3432696 | 9589237 | 6163411 | ... | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 | 5595070 |
5 rows × 31 columns