阿里跨境电商智能算法大赛简单 baseline

赛题介绍

给出到参赛选手的数据

商品属性表: 数据中共涉及2840536个商品,对于其中大部分商品,我们都会给出该商品的类目id、店铺id以及加密价格,其中价格的加密函数f(x)为一个单调增函数。

训练数据: 给出xx国的用户的购买数据和yy国的A部分用户的购买数据。数据的整体统计信息如下:

国家记录数买家数
xx10635642670631
yy2232867138678

测试数据:给出yy国的B部分用户的购买数据除掉最后一条。数据的整体统计信息如下:

国家记录数买家数
yy16683211398

商品属性表、训练数据、测试数据对应的文件列表为:item_attr, train和test。

数据格式:
无论是训练数据还是测试数据,都具有如下的格式:

buyer_country_idbuyer_admin_iditem_idcreate_order_timeirank
xx81773140335252018-06-12 07:12:581
xx817731981202018-06-11 07:12:582

其中各字段含义如下:
buyer_country_id: 买家国家id, 只有’xx’和’yy’两种取值

buyer_admin_id: 买家id

item_id: 商品id

create_order_time: 订单创建时间

irank: 每个买家对应的所有记录按照时间顺序的逆排序

数据集特点:
1)每个用户有至少7条购买数据;
2)测试数据中每个用户的最后一条购买数据所对应的商品一定在训练数据中出现过;
3)少量用户在两个国家有购买记录,评测中将忽略这部分记录;

要求选手提交的数据
关于yy国的B部分用户每个用户的最后一条购买数据的预测Top30

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import gc
%matplotlib inline
# 禁用科学计数法
pd.set_option('display.float_format',lambda x : '%.2f' % x)
# 读取数据
item = pd.read_csv('./Antai_AE_round1_item_attr_20190626.csv')
train = pd.read_csv('./Antai_AE_round1_train_20190626.csv')
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
submit = pd.read_csv('./Antai_AE_round1_submit_20190715.csv')

数据预处理

  • 合并train和test文件
  • 提取日期年月日等信息
  • 关联商品价格、品类、店铺
  • 转化每列数据类型为可存储的最小值,减少内存消耗
  • 保存为hdf5格式文件,加速读取
# 合并train和test文件
df = pd.concat([train.assign(is_train=1), test.assign(is_train=0)])

# 提取日期年月日等信息
df['create_order_time'] = pd.to_datetime(df['create_order_time'])
df['date'] = df['create_order_time'].dt.date
df['day'] = df['create_order_time'].dt.day
df['hour'] = df['create_order_time'].dt.hour

df = pd.merge(df, item, how='left', on='item_id')

memory = df.memory_usage().sum() / 1024**2 
print('Before memory usage of properties dataframe is :', memory, " MB")

# 转化每列数据类型为可存储的最小值,减少内存消耗
dtype_dict = {'buyer_admin_id' : 'int32', 
              'item_id' : 'int32', 
              'store_id' : pd.Int32Dtype(),
              'irank' : 'int16',
              'item_price' : pd.Int16Dtype(),
              'cate_id' : pd.Int16Dtype(),
              'is_train' : 'int8',
              'day' : 'int8',
              'hour' : 'int8',
             }

df = df.astype(dtype_dict)
memory = df.memory_usage().sum() / 1024**2 
print('After memory usage of properties dataframe is :', memory, " MB")
del train,test; gc.collect()
Before memory usage of properties dataframe is : 1093.9694747924805  MB
After memory usage of properties dataframe is : 596.7106781005859  MB





0
# 保存为hdf5格式文件,加速读取
for col in ['store_id', 'item_price', 'cate_id']:
    df[col] = df[col].fillna(0).astype(np.int32).replace(0, np.nan)
df.to_hdf('./train_test.h5', '1.0')
/opt/anaconda3/lib/python3.8/site-packages/tables/path.py:137: NaturalNameWarning: object name is not a valid Python identifier: '1.0'; it does not match the pattern ``^[a-zA-Z_][a-zA-Z0-9_]*$``; you will not be able to use natural naming to access this object; using ``getattr()`` will still work, though
  check_attribute_name(name)
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/1022970888.py:4: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block5_values] [items->Index(['buyer_country_id', 'date'], dtype='object')]

  df.to_hdf('./train_test.h5', '1.0')
# 采用hdf5格式存储,读取时间从9秒减少到仅需5秒
%%time
df = pd.read_hdf('./train_test.h5', '1.0')
CPU times: user 3.84 s, sys: 1.49 s, total: 5.33 s
Wall time: 5.39 s
%%time
train = pd.read_csv('./Antai_AE_round1_train_20190626.csv')
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
item = pd.read_csv('./Antai_AE_round1_item_attr_20190626.csv')
del train, test; gc.collect()
CPU times: user 8.65 s, sys: 716 ms, total: 9.36 s
Wall time: 9.51 s





3024

数据探索

  • 用户、商品、店铺、品类乃至商品价格都是从1开始用整数编号
  • 订单日期格式为:YYYY-mm-dd HH:mm:ss
  • 源数据中没有空值,但是由于某些商品,不在商品表,因此缺少了一些价格、品类信息。
df.head()
buyer_country_idbuyer_admin_iditem_idcreate_order_timeirankis_traindatedayhourcate_idstore_iditem_price
0xx836207812018-08-10 23:49:441212018-08-1010232324.0010013.004501.00
1xx969430422018-08-03 23:55:07912018-08-033233882.004485.002751.00
2yy10188732018-08-27 08:31:26312018-08-27278155.008341.00656.00
3xx813178632018-08-31 06:00:19912018-08-31316155.008341.00656.00
4xx977861352018-08-21 06:01:561412018-08-212161191.001949.001689.00
# Null 空值统计
for pdf in [df, item]:
    for col in pdf.columns:
        print(col, pdf[col].isnull().sum())
buyer_country_id 0
buyer_admin_id 0
item_id 0
create_order_time 0
irank 0
is_train 0
date 0
day 0
hour 0
cate_id 26119
store_id 26119
item_price 26119
item_id 0
cate_id 0
store_id 0
item_price 0
df.describe()
buyer_admin_iditem_idcreate_order_timeirankis_traindayhourcate_idstore_iditem_price
count13035341.0013035341.001303534113035341.0013035341.0013035341.0013035341.0013009222.0013009222.0013009222.00
mean6527293.866522519.782018-08-18 23:20:45.258000384143.620.9918.629.061498.5340575.671099.75
min1.001.002018-07-13 05:54:54-32768.000.001.000.001.001.001.00
25%3269515.003261386.002018-08-10 19:40:334.001.0010.004.00616.0020648.00123.00
50%6528429.006522878.002018-08-19 13:55:458.001.0019.008.001505.0039368.00246.00
75%9787265.009784900.002018-08-27 11:57:0016.001.0027.0013.002010.0059273.00700.00
max13046721.0013046734.002018-08-31 23:59:5732767.001.0031.0023.004243.0095105.0020230.00
std3764280.243765432.09NaN1573.840.119.216.56903.2624284.462880.00
item.describe()
item_idcate_idstore_iditem_price
count2832669.002832669.002832669.002832669.00
mean6429138.001481.1040256.461124.00
std3725431.44923.0924370.922110.62
min1.001.001.001.00
25%3224114.00600.0019850.00180.00
50%6391845.001499.0038954.00400.00
75%9636216.002050.0058406.001200.00
max13046734.004243.0095105.0020230.00

数据探查

训练集与测试集

train = df['is_train']==1
test = df['is_train']==0
train_count = len(df[train])
print('训练集样本量是',train_count)
test_count = len(df[test])
print('测试集样本量是',test_count)
print('样本比例为:', train_count/test_count)
训练集样本量是 12868509
测试集样本量是 166832
样本比例为: 77.13453653975256
# buyer_country_id 国家编号
def groupby_cnt_ratio(df, col):
    if isinstance(col, str):
        col = [col]
    key = ['is_train', 'buyer_country_id'] + col
    
    # groupby function
    cnt_stat = df.groupby(key).size().to_frame('count')
    ratio_stat = (cnt_stat / cnt_stat.groupby(['is_train', 'buyer_country_id']).sum()).rename(columns={'count':'count_ratio'})
    return pd.merge(cnt_stat, ratio_stat, on=key, how='outer').sort_values(by=['count'], ascending=False)
groupby_cnt_ratio(df, [])
countcount_ratio
is_trainbuyer_country_id
1xx106356421.00
yy22328671.00
0yy1668321.00
plt.figure(figsize=(8,6))
sns.countplot(x='is_train', data = df, palette=['red', 'blue'], hue='buyer_country_id', order=[1, 0])
plt.xticks(np.arange(2), ('training set', 'testing set'))
plt.xlabel('data')
plt.title('country id');
![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/1fdc22c494a744fd95fbccfd3f8b2032.png#pic_center)
  • 训练集中有2个国家数据,xx国家样本数10635642,占比83%,yy国家样本数2232867条,仅占17%
  • 预测集中有yy国家的166832数据, 训练集中yy国样本数量是测试集中的13倍,如赛题目的所言,期望通过大量成熟国家来预测少量带成熟国家的用户购买行为

buyer_admin_id 用户编号

print('训练集中用户数量',len(df[train]['buyer_admin_id'].unique()))
print('测试集中用户数量',len(df[test]['buyer_admin_id'].unique()))
训练集中用户数量 809213
测试集中用户数量 11398
union = list(set(df[train]['buyer_admin_id'].unique()).intersection(set(df[test]['buyer_admin_id'].unique())))
print('同时在训练集测试集出现的有6位用户,id如下:',union)
同时在训练集测试集出现的有6位用户,id如下: [12647969, 13000419, 3106927, 12858772, 12929117, 12368445]
df[train][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(10)
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/3035833051.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df[train][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(10)
buyer_country_idbuyer_admin_iditem_idcreate_order_timeirankis_traindatedayhourcate_idstore_iditem_price
7546704xx310692776455462018-08-30 02:49:22112018-08-303021164.0073781.00770.00
4582539xx310692746391512018-08-30 02:49:22212018-08-303022214.0053190.001669.00
11953258xx3106927121221182018-08-30 02:49:22312018-08-30302236.0073781.00884.00
255625xx31069272588602018-08-30 02:49:22412018-08-30302189.0024221.00900.00
7402817xx310692774993722018-08-30 02:49:22512018-08-303022214.0032535.002714.00
9483312xx310692796130632018-08-30 02:49:22612018-08-303023069.0073781.00110.00
2740080xx310692727731892018-08-27 08:18:231012018-08-272781865.0049499.0020067.00
12152249xx3106927123240302018-08-27 07:15:051112018-08-27277880.0092968.001764.00
2201292xx310692722277202018-08-19 02:36:361212018-08-191921164.006404.001900.00
6717641xx310692768041872018-08-19 02:33:391312018-08-191921164.0052421.00230.00
df[test][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(3)
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/3087106983.py:1: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  df[test][df['buyer_admin_id'].isin(union)].sort_values(by=['buyer_admin_id','irank']).head(3)
buyer_country_idbuyer_admin_iditem_idcreate_order_timeirankis_traindatedayhourcate_idstore_iditem_price
13016145yy31069272023542018-08-30 02:48:40702018-08-30302642.0024221.00989.00
13008981yy310692769944142018-08-29 05:48:06802018-08-292957.0037411.001521.00
13008982yy310692769944142018-08-29 05:48:06902018-08-292957.0037411.001521.00
df[(train) & (df['irank']==1) & (df['buyer_admin_id'].isin(['12858772','3106927','12368445']))]
buyer_country_idbuyer_admin_iditem_idcreate_order_timeirankis_traindatedayhourcate_idstore_iditem_price

用户记录数分布

admin_cnt = groupby_cnt_ratio(df, 'buyer_admin_id')
admin_cnt.groupby(['is_train','buyer_country_id']).head(3)
countcount_ratio
is_trainbuyer_country_idbuyer_admin_id
1xx10828801427510.00
10951390235690.00
11223615199330.00
yy238178234800.00
233331619440.00
236535616860.00
0yy204103813860.01
20704303990.00
11448482860.00
# 用户购买记录数——最多、最少、中位数
admin_cnt.groupby(['is_train','buyer_country_id'])['count'].agg(['max','min','median'])
maxminmedian
is_trainbuyer_country_id
0yy1386711.00
1xx42751811.00
yy3480812.00
  • 训练集中记录了809213个用户的数据,其中id为10828801的用户拔得头筹,有42751条购买记录,用户至少都有8条记录
  • 训练集中记录了11398个用户的数据,其中id为2041038的用户勇冠三军,有1386条购买记录,用户至少有7条记录
    Notes: 验证集中用户最少仅有7条,是因为最后一条记录被抹去

用户记录数大都都分布在0~50,少量用户记录甚至超过了10000条,下一步对用户记录数分布继续探索

fig, ax = plt.subplots(1, 2 ,figsize=(16,6))
ax[0].set(xlabel='用户记录数')
sns.kdeplot(admin_cnt.loc[(1, 'xx')]['count'].values, ax=ax[0]).set_title('训练集--xx国用户记录数')

ax[1].legend(labels=['训练集', '测试集'], loc="upper right")
ax[1].set(xlabel='用户记录数')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].loc[(1, 'yy')]['count'].values, ax=ax[1]).set_title('yy国用户记录数')
sns.kdeplot(admin_cnt[admin_cnt['count']<50].loc[(0, 'yy')]['count'].values, ax=ax[1]);

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

admin_cnt.columns = ['记录数', '占比']
admin_user_cnt = groupby_cnt_ratio(admin_cnt, '记录数')
admin_user_cnt.columns = ['人数', '人数占比']
admin_user_cnt.head()
人数人数占比
is_trainbuyer_country_id记录数
1xx81181550.18
9917570.14
10729360.11
11576780.09
12465340.07

在这里插入图片描述

# xx国——用户记录数与用户数
admin_user_cnt.loc[(1,'xx')][['人数','人数占比']].T
/var/folders/hh/1l1hnwqj0nz_8_nfwygkd3mr0000gn/T/ipykernel_55965/313832335.py:2: PerformanceWarning: indexing past lexsort depth may impact performance.
  admin_user_cnt.loc[(1,'xx')][['人数','人数占比']].T
记录数891011121314151617...521556526528529537545549550554
人数118155.0091757.0072936.0057678.0046534.0038114.0031432.0026735.0022352.0018742.00...1.001.001.001.001.001.001.001.001.001.00
人数占比0.180.140.110.090.070.060.050.040.030.03...0.000.000.000.000.000.000.000.000.000.00

2 rows × 506 columns

# yy国——记录数与用户数占比
admin_user_cnt.loc[([1,0], 'yy', slice(None))][['人数', '人数占比']].unstack(0).drop(columns='人数').head(10)
人数占比
is_train01
buyer_country_id记录数
yy80.130.16
90.100.13
100.090.10
110.080.08
120.060.07
130.050.06
140.040.05
150.040.04
160.030.03
170.030.03

简单 baseline

# 选取用户近30次购买记录作为预测值,越近购买的商品放在越靠前的列,不够30次购买记录的用热销商品5595070填充
test = pd.read_csv('./Antai_AE_round1_test_20190626.csv')
tmp = test[test['irank']<=31].sort_values(by=['buyer_country_id', 'buyer_admin_id', 'irank'])[['buyer_admin_id','item_id','irank']]
sub = tmp.set_index(['buyer_admin_id', 'irank']).unstack(-1)
sub.fillna(5595070).astype(int).reset_index().to_csv('./sub.csv', index=False, header=None)

# 最终提交文件格式
sub = pd.read_csv('./sub.csv', header = None)
sub.head()
0123456789...21222324252627282930
0152841085779371548472223401606698915138064216835184055950705595070...5595070559507055950705595070559507055950705595070559507055950705595070
12821172180276654237665423108083931131070862358265476072605373688799...5595070559507055950705595070559507055950705595070559507055950705595070
23211461800737984592432867379845627849500075911774753109322884813286...5595070559507055950705595070559507055950705595070559507055950705595070
3809234761657070106339286549200312075745707010549200312075742262443...5595070559507055950705595070559507055950705595070559507055950705595070
48701138269459992446611583741227243436475546383343269695892376163411...5595070559507055950705595070559507055950705595070559507055950705595070

5 rows × 31 columns

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值