阿里跨境电商智能算法大赛简单 baseline_aliexpress消费行为预测算法赛-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_36589234/article/details/139280887

赛题介绍

给出到参赛选手的数据

商品属性表: 数据中共涉及2840536个商品，对于其中大部分商品，我们都会给出该商品的类目id、店铺id以及加密价格，其中价格的加密函数f(x)为一个单调增函数。

训练数据: 给出xx国的用户的购买数据和yy国的A部分用户的购买数据。数据的整体统计信息如下：

国家	记录数	买家数
xx	10635642	670631
yy	2232867	138678

测试数据:给出yy国的B部分用户的购买数据除掉最后一条。数据的整体统计信息如下：

国家	记录数	买家数
yy	166832	11398

商品属性表、训练数据、测试数据对应的文件列表为：item_attr, train和test。

数据格式：
无论是训练数据还是测试数据，都具有如下的格式：

buyer_country_id	buyer_admin_id	item_id	create_order_time	irank
xx	817731	4033525	2018-06-12 07:12:58	1
xx	817731	98120	2018-06-11 07:12:58	2

其中各字段含义如下：
buyer_country_id: 买家国家id, 只有’xx’和’yy’两种取值

buyer_admin_id: 买家id

item_id: 商品id

create_order_time: 订单创建时间

irank: 每个买家对应的所有记录按照时间顺序的逆排序

数据集特点：
1）每个用户有至少7条购买数据；
2）测试数据中每个用户的最后一条购买数据所对应的商品一定在训练数据中出现过；
3）少量用户在两个国家有购买记录，评测中将忽略这部分记录；

要求选手提交的数据
关于yy国的B部分用户每个用户的最后一条购买数据的预测Top30

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import gc
%matplotlib inline
# 禁用科学计数法
pd.set_option('display.float_format',lambda x : '%.2f' % x)

# 读取数据
item = pd.read_csv('./Antai_AE_round1_item_attr_20190626.csv')
train = pd.read_csv('./Antai_AE_round1_train_20190626.csv')
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
submit = pd.read_csv('./Antai_AE_round1_submit_20190715.csv')

数据预处理

合并train和test文件
提取日期年月日等信息
关联商品价格、品类、店铺
转化每列数据类型为可存储的最小值，减少内存消耗
保存为hdf5格式文件，加速读取

# 合并train和test文件
df = pd.concat([train.assign(is_train=1), test.assign(is_train=0)])

# 提取日期年月日等信息
df['create_order_time'] = pd.to_datetime(df['create_order_time'])
df['date'] = df['create_order_time'].dt.date
df['day'] = df['create_order_time'].dt.day
df['hour'] = df['create_order_time'].dt.hour

df = pd.merge(df, item, how='left', on='item_id')

memory = df.memory_usage().sum() / 1024**2 
print('Before memory usage of properties dataframe is :', memory, " MB")

# 转化每列数据类型为可存储的最小值，减少内存消耗
dtype_dict = {'buyer_admin_id' : 'int32', 
              'item_id' : 'int32', 
              'store_id' : pd.Int32Dtype(),
              'irank' : 'int16',
              'item_price' : pd.Int16Dtype(),
              'cate_id' : pd.Int16Dtype(),
              'is_train' : 'int8',
              'day' : 'int8',
              'hour' : 'int8',
             }

df = df.astype(dtype_dict)
memory = df.memory_usage().sum() / 1024**2 
print('After memory usage of properties dataframe is :', memory, " MB")
del train,test; gc.collect()

Before memory usage of properties dataframe is : 1093.9694747924805  MB
After memory usage of properties dataframe is : 596.7106781005859  MB





0

# 保存为hdf5格式文件，加速读取
for col in ['store_id', 'item_price', 'cate_id']:
    df[col] = df[col].fillna(0).astype(np.int32).replace(0, np.nan)
df.to_hdf('./train_test.h5', '1.0')

/opt/anaconda3/lib/python3.8/site-packages/tables/path.py:137: NaturalNameWarning: object name is not a valid Python identifier: '1.0'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
  check_attribute_name(name)
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/1022970888.py:4: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block5_values] [items->Index(['buyer_country_id', 'date'], dtype='object')]

  df.to_hdf('./train_test.h5', '1.0')

# 采用hdf5格式存储，读取时间从9秒减少到仅需5秒
%%time
df = pd.read_hdf('./train_test.h5', '1.0')

CPU times: user 3.84 s, sys: 1.49 s, total: 5.33 s
Wall time: 5.39 s

%%time
train = pd.read_csv('./Antai_AE_round1_train_20190626.csv')
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
item = pd.read_csv('./Antai_AE_round1_item_attr_20190626.csv')
del train, test; gc.collect()

CPU times: user 8.65 s, sys: 716 ms, total: 9.36 s
Wall time: 9.51 s





3024

数据探索

用户、商品、店铺、品类乃至商品价格都是从1开始用整数编号
订单日期格式为：YYYY-mm-dd HH:mm:ss
源数据中没有空值，但是由于某些商品，不在商品表，因此缺少了一些价格、品类信息。

df.head()

	buyer_country_id	buyer_admin_id	item_id	create_order_time	irank	is_train	date	day	hour	cate_id	store_id	item_price
0	xx	8362078	1	2018-08-10 23:49:44	12	1	2018-08-10	10	23	2324.00	10013.00	4501.00
1	xx	9694304	2	2018-08-03 23:55:07	9	1	2018-08-03	3	23	3882.00	4485.00	2751.00
2	yy	101887	3	2018-08-27 08:31:26	3	1	2018-08-27	27	8	155.00	8341.00	656.00
3	xx	8131786	3	2018-08-31 06:00:19	9	1	2018-08-31	31	6	155.00	8341.00	656.00
4	xx	9778613	5	2018-08-21 06:01:56	14	1	2018-08-21	21	6	1191.00	1949.00	1689.00

# Null 空值统计
for pdf in [df, item]:
    for col in pdf.columns:
        print(col, pdf[col].isnull().sum())

buyer_country_id 0
buyer_admin_id 0
item_id 0
create_order_time 0
irank 0
is_train 0
date 0
day 0
hour 0
cate_id 26119
store_id 26119
item_price 26119
item_id 0
cate_id 0
store_id 0
item_price 0

df.describe()

	buyer_admin_id	item_id	create_order_time	irank	is_train	day	hour	cate_id	store_id	item_price
count	13035341.00	13035341.00	13035341	13035341.00	13035341.00	13035341.00	13035341.00	13009222.00	13009222.00	13009222.00
mean	6527293.86	6522519.78	2018-08-18 23:20:45.258000384	143.62	0.99	18.62	9.06	1498.53	40575.67	1099.75
min	1.00	1.00	2018-07-13 05:54:54	-32768.00	0.00	1.00	0.00	1.00	1.00	1.00
25%	3269515.00	3261386.00	2018-08-10 19:40:33	4.00	1.00	10.00	4.00	616.00	20648.00	123.00
50%	6528429.00	6522878.00	2018-08-19 13:55:45	8.00	1.00	19.00	8.00	1505.00	39368.00	246.00
75%	9787265.00	9784900.00	2018-08-27 11:57:00	16.00	1.00	27.00	13.00	2010.00	59273.00	700.00
max	13046721.00	13046734.00	2018-08-31 23:59:57	32767.00	1.00	31.00	23.00	4243.00	95105.00	20230.00
std	3764280.24	3765432.09	NaN	1573.84	0.11	9.21	6.56	903.26	24284.46	2880.00

item.describe()

	item_id	cate_id	store_id	item_price
count	2832669.00	2832669.00	2832669.00	2832669.00
mean	6429138.00	1481.10	40256.46	1124.00
std	3725431.44	923.09	24370.92	2110.62
min	1.00	1.00	1.00	1.00
25%	3224114.00	600.00	19850.00	180.00
50%	6391845.00	1499.00	38954.00	400.00
75%	9636216.00	2050.00	58406.00	1200.00
max	13046734.00	4243.00	95105.00	20230.00

数据探查

训练集与测试集

train = df['is_train']==1
test = df['is_train']==0

train_count = len(df[train])
print('训练集样本量是',train_count)
test_count = len(df[test])
print('测试集样本量是',test_count)
print('样本比例为：', train_count/test_count)

训练集样本量是 12868509
测试集样本量是 166832
样本比例为： 77.13453653975256

# buyer_country_id 国家编号
def groupby_cnt_ratio(df, col):
    if isinstance(col, str):
        col = [col]
    key = ['is_train', 'buyer_country_id'] + col
    
    # groupby function
    cnt_stat = df.groupby(key).size().to_frame('count')
    ratio_stat = (cnt_stat / cnt_stat.groupby(['is_train', 'buyer_country_id']).sum()).rename(columns={'count':'count_ratio'})
    return pd.merge(cnt_stat, ratio_stat, on=key, how='outer').sort_values(by=['count'], ascending=False)

groupby_cnt_ratio(df, [])

		count	count_ratio
is_train	buyer_country_id
1	xx	10635642	1.00
1	yy	2232867	1.00
0	yy	166832	1.00

plt.figure(figsize=(8,6))
sns.countplot(x='is_train', data = df, palette=['red', 'blue'], hue='buyer_country_id', order=[1, 0])
plt.xticks(np.arange(2), ('training set', 'testing set'))
plt.xlabel('data')
plt.title('country id');

![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/1fdc22c494a744fd95fbccfd3f8b2032.png#pic_center)

训练集中有2个国家数据，xx国家样本数10635642，占比83%，yy国家样本数2232867条，仅占17%
预测集中有yy国家的166832数据, 训练集中yy国样本数量是测试集中的13倍，如赛题目的所言，期望通过大量成熟国家来预测少量带成熟国家的用户购买行为

buyer_admin_id 用户编号

print('训练集中用户数量',len(df[train]['buyer_admin_id'].unique()))
print('测试集中用户数量',len(df[test]['buyer_admin_id'].unique()))

训练集中用户数量 809213
测试集中用户数量 11398

union = list(set(df[train]['buyer_admin_id'].unique()).intersection(set(df[test]['buyer_admin_id'].unique())))
print('同时在训练集测试集出现的有6位用户，id如下：',union)

同时在训练集测试集出现的有6位用户，id如下： [12647969, 13000419, 3106927, 12858772, 12929117, 12368445]

df[train][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(10)

/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/3035833051.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df[train][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(10)

	buyer_country_id	buyer_admin_id	item_id	create_order_time	irank	is_train	date	day	hour	cate_id	store_id	item_price
7546704	xx	3106927	7645546	2018-08-30 02:49:22	1	1	2018-08-30	30	2	1164.00	73781.00	770.00
4582539	xx	3106927	4639151	2018-08-30 02:49:22	2	1	2018-08-30	30	2	2214.00	53190.00	1669.00
11953258	xx	3106927	12122118	2018-08-30 02:49:22	3	1	2018-08-30	30	2	236.00	73781.00	884.00
255625	xx	3106927	258860	2018-08-30 02:49:22	4	1	2018-08-30	30	2	189.00	24221.00	900.00
7402817	xx	3106927	7499372	2018-08-30 02:49:22	5	1	2018-08-30	30	2	2214.00	32535.00	2714.00
9483312	xx	3106927	9613063	2018-08-30 02:49:22	6	1	2018-08-30	30	2	3069.00	73781.00	110.00
2740080	xx	3106927	2773189	2018-08-27 08:18:23	10	1	2018-08-27	27	8	1865.00	49499.00	20067.00
12152249	xx	3106927	12324030	2018-08-27 07:15:05	11	1	2018-08-27	27	7	880.00	92968.00	1764.00
2201292	xx	3106927	2227720	2018-08-19 02:36:36	12	1	2018-08-19	19	2	1164.00	6404.00	1900.00
6717641	xx	3106927	6804187	2018-08-19 02:33:39	13	1	2018-08-19	19	2	1164.00	52421.00	230.00

df[test][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(3)

/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/3087106983.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df[test][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(3)

	buyer_country_id	buyer_admin_id	item_id	create_order_time	irank	date	day	hour	cate_id	store_id	item_price
13016145	yy	3106927	202354	2018-08-30 02:48:40	7	2018-08-30	30	2	642.00	24221.00	989.00
13008981	yy	3106927	6994414	2018-08-29 05:48:06	8	2018-08-29	29	5	7.00	37411.00	1521.00
13008982	yy	3106927	6994414	2018-08-29 05:48:06	9	2018-08-29	29	5	7.00	37411.00	1521.00

df[(train) & (df['irank']==1) & (df['buyer_admin_id'].isin(['12858772','3106927','12368445']))]

	buyer_country_id	buyer_admin_id	item_id	create_order_time	irank	is_train	date	day	hour	cate_id	store_id	item_price

用户记录数分布

admin_cnt = groupby_cnt_ratio(df, 'buyer_admin_id')
admin_cnt.groupby(['is_train','buyer_country_id']).head(3)

			count	count_ratio
is_train	buyer_country_id	buyer_admin_id
1	xx	10828801	42751	0.00
		10951390	23569	0.00
		11223615	19933	0.00
	yy	2381782	3480	0.00
		2333316	1944	0.00
		2365356	1686	0.00
0	yy	2041038	1386	0.01
		2070430	399	0.00
		1144848	286	0.00

# 用户购买记录数——最多、最少、中位数
admin_cnt.groupby(['is_train','buyer_country_id'])['count'].agg(['max','min','median'])

		max	min	median
is_train	buyer_country_id
0	yy	1386	7	11.00
1	xx	42751	8	11.00
1	yy	3480	8	12.00

训练集中记录了809213个用户的数据，其中id为10828801的用户拔得头筹，有42751条购买记录，用户至少都有8条记录
训练集中记录了11398个用户的数据，其中id为2041038的用户勇冠三军，有1386条购买记录，用户至少有7条记录
Notes: 验证集中用户最少仅有7条，是因为最后一条记录被抹去

用户记录数大都都分布在0~50，少量用户记录甚至超过了10000条，下一步对用户记录数分布继续探索

fig, ax = plt.subplots(1, 2 ,figsize=(16,6))
ax[0].set(xlabel='用户记录数')
sns.kdeplot(admin_cnt.loc[(1, 'xx')]['count'].values, ax=ax[0]).set_title('训练集--xx国用户记录数')

ax[1].legend(labels=['训练集', '测试集'], loc="upper right")
ax[1].set(xlabel='用户记录数')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].loc[(1, 'yy')]['count'].values, ax=ax[1]).set_title('yy国用户记录数')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].loc[(0, 'yy')]['count'].values, ax=ax[1]);

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

admin_cnt.columns = ['记录数', '占比']
admin_user_cnt = groupby_cnt_ratio(admin_cnt, '记录数')
admin_user_cnt.columns = ['人数', '人数占比']
admin_user_cnt.head()

			人数	人数占比
is_train	buyer_country_id	记录数
1	xx	8	118155	0.18
		9	91757	0.14
		10	72936	0.11
		11	57678	0.09
		12	46534	0.07

在这里插入图片描述

# xx国——用户记录数与用户数
admin_user_cnt.loc[(1,'xx')][['人数','人数占比']].T

/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/313832335.py:2: PerformanceWarning: indexing past lexsort depth may impact performance.
  admin_user_cnt.loc[(1,'xx')][['人数','人数占比']].T

记录数	8	9	10	11	12	13	14	15	16	17	...	521	556	526	528	529	537	545	549	550	554
人数	118155.00	91757.00	72936.00	57678.00	46534.00	38114.00	31432.00	26735.00	22352.00	18742.00	...	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
人数占比	0.18	0.14	0.11	0.09	0.07	0.06	0.05	0.04	0.03	0.03	...	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00

2 rows × 506 columns

# yy国——记录数与用户数占比
admin_user_cnt.loc[([1,0], 'yy', slice(None))][['人数', '人数占比']].unstack(0).drop(columns='人数').head(10)

		人数占比
	is_train	0	1
buyer_country_id	记录数
yy	8	0.13	0.16
	9	0.10	0.13
	10	0.09	0.10
	11	0.08	0.08
	12	0.06	0.07
	13	0.05	0.06
	14	0.04	0.05
	15	0.04	0.04
	16	0.03	0.03
	17	0.03	0.03

简单 baseline

# 选取用户近30次购买记录作为预测值，越近购买的商品放在越靠前的列，不够30次购买记录的用热销商品5595070填充
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
tmp = test[test['irank']<=31].sort_values(by=['buyer_country_id', 'buyer_admin_id', 'irank'])[['buyer_admin_id','item_id','irank']]
sub = tmp.set_index(['buyer_admin_id', 'irank']).unstack(-1)
sub.fillna(5595070).astype(int).reset_index().to_csv('./sub.csv', index=False, header=None)

# 最终提交文件格式
sub = pd.read_csv('./sub.csv', header = None)
sub.head()

	0	1	2	3	4	5	6	7	8	9	...	21	22	23	24	25	26	27	28	29	30
0	152	8410857	7937154	8472223	4016066	9891513	8064216	8351840	5595070	5595070	...	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070
1	282	11721802	7665423	7665423	10808393	11310708	623582	6547607	2605373	688799	...	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070
2	321	1461800	7379845	9243286	7379845	627849	5000759	11774753	10932288	4813286	...	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070
3	809	2347616	5707010	6339286	5492003	1207574	5707010	5492003	1207574	2262443	...	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070
4	870	11382694	5999244	6611583	7412272	4343647	5546383	3432696	9589237	6163411	...	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070	5595070

5 rows × 31 columns