【Kaggle】练习赛《预测贴纸的销量》(下)

前言

上篇利用各地区的GDP数据还填充目标标签的缺失值;中篇顺着这个思路,利用这个原理来预测未来的销量,具体方法思路:先一一对国家、产品和商店进行汇总,然后对未来三年的每日销售额进行预测,然后再进行分解,得到每个国家、产品和商店的销售额;本篇利用机器学习的常规思路进行预测未来的销量。
分别采用了不同的方案,对时间序列的特点,根据上篇EDA的特性,增加了周期性的内容,如何正弦,余弦特征列等。

加载库

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

加载数据

train = pd.read_csv("/kaggle/input/playground-series-s5e1/train.csv")
test = pd.read_csv("/kaggle/input/playground-series-s5e1/test.csv")

方案一

# 复制一份原始数据
train_data=train.copy()
test_data=test.copy()
# 查看缺失值情况
train_data.isna().sum().sort_values(ascending=False)

num_sold 8871
id 0
date 0
country 0
store 0
product 0
dtype: int64

# 直接删除
train_data = train_data.dropna()
EDA
train_data['date'] = pd.to_datetime(train_data['date'])
test_data['date'] = pd.to_datetime(test_data['date'])

train_data['Year'] = train_data['date'].dt.year
train_data['Month'] = train_data['date'].dt.month
train_data['Day'] = train_data['date'].dt.day

test_data['Year'] = test_data['date'].dt.year
test_data['Month'] = test_data['date'].dt.month
test_data['Day'] = test_data['date'].dt.day

简单的处理一下年月日,根据上篇EDA基本特征。

# 删除 date
train_data.drop('date',axis=1,inplace=True)
test_data.drop('date',axis=1,inplace=True)
分离出分类和数据特征
train_data = train_data.drop('id', axis = 1)
num_cols = list(train_data.select_dtypes(exclude=['object']).columns.difference(['num_sold']))
cat_cols = list(train_data.select_dtypes(include=['object']).columns)

test_data = test_data.drop('id', axis = 1)
num_cols_test = list(test_data.select_dtypes(exclude=['object']).columns.difference(['id']))
cat_cols_test = list(test_data.select_dtypes(include=['object']).columns)
对分类特征进行编码
from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
label_encoders = {col: LabelEncoder() for col in cat_cols}

# Apply LabelEncoder to each categorical column
for col in cat_cols:
    train_data[col] = label_encoders[col].fit_transform(train_data[col])
    test_data[col] = label_encoders[col].transform(test_data[col])
对分离X、y ,并分隔训练集和验证集
from sklearn.model_selection import train_test_split
X = train_data.drop(['num_sold'], axis=1)
y = train_data['num_sold']


# Split datainto training set and test set
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
引入xgbboost的回归算法库,定义评估函数
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
# Define MAPE metric
def mape(y_true, y_pred):
    return mean_absolute_percentage_error(y_true, y_pred)
def mse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred)
用xgbboost的回归算法建模
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
预测和评估
# 预测和评估
y_pred = xgb.predict(X_valid)

score = mape(y_valid, y_pred)
me=mse(y_valid, y_pred)
score,me

(0.30150839272752505, 9599.83371487707)

方案二

查看标签的偏度和峰度
train_data['num_sold'].skew(),train_data['num_sold'].kurtosis()

(1.4153734524983919, 2.61233506292136)

发现数据分布偏度较大

画图查看数据分布
sns.histplot(data=y, kde=True, stat="density");

 0

从图形看出,呈现明显偏态情况,因此对数据进行对数处理,再看图形

sns.histplot(data=np.log(y), kde=True, stat="density");

01

经过对数处理后,偏态情况明显改善,因此以对数后的值作为标签进行建模

对数处理后
y=np.log1p(y)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

不直接用log函数,而采用np.log1p函数,即加1后求对数 ,更好避免 0 求对数,同理,评估和预测时,是需要返回的,即使用np.expm1的方式。

建模、预测、评估
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
 # 预测和评估
y_pred = xgb.predict(X_valid)

# score = mape(y_valid, y_pred) # 注意这里不对的,一定要返回原来的值,=否则不能比较了
# me=mse(y_valid, y_pred)
# (0.012581281054301561, 0.00724846629002101)

score = mape(np.expm1(y_valid), np.expm1(y_pred))
me=mse(np.expm1(y_valid), np.expm1(y_pred))
score,me

(0.06777639649910845, 8359.441860251725)

效果提升明显

评测试结果

方案MAPE 得分处理方式
方案一0.3015年月日分离
方案二0.0678方案一基础上将标签对数处理

方案三

train_data=train.copy()
test_data=test.copy()
train_data = train_data.dropna()
常规年月日处理
train_data['date'] = pd.to_datetime(train_data['date'])
test_data['date'] = pd.to_datetime(test_data['date'])

train_data['year'] = train_data['date'].dt.year
train_data['month'] = train_data['date'].dt.month
train_data['day'] = train_data['date'].dt.day

test_data['year'] = test_data['date'].dt.year
test_data['month'] = test_data['date'].dt.month
test_data['day'] = test_data['date'].dt.day
增加GDP权重比率

这里增加的方式与前面的方式不一样,直接用API调用的方式,灵活方便

# Function to fetch GDP per capita
import requests
def get_gdp_per_capita(country, year):
    alpha3 = {
        'Canada': 'CAN', 'Finland': 'FIN', 'Italy': 'ITA',
        'Kenya': 'KEN', 'Norway': 'NOR', 'Singapore': 'SGP'
    }
    url = f"https://api.worldbank.org/v2/country/{alpha3[country]}/indicator/NY.GDP.PCAP.CD?date={year}&format=json"
    response = requests.get(url).json()
    try:
        return response[1][0]['value']
    except (IndexError, TypeError):
        return None

countries = ['Canada', 'Finland', 'Italy', 'Kenya', 'Norway', 'Singapore']
years = range(2010, 2020)
gdp_data = {}

for country in countries:
    for year in years:
        gdp_data[(country, year)] = get_gdp_per_capita(country, year)

# Add GDP feature to train and test DataFrames
def add_gdp_feature(df):
    df['date'] = pd.to_datetime(df['date'])
    df['year'] = df['date'].dt.year  # Extract year from the date
    df['gdp'] = df.apply(lambda row: gdp_data.get((row['country'], row['year']), None), axis=1)
    return df

# Apply to train and test datasets
train_data = add_gdp_feature(train_data)
test_data = add_gdp_feature(test_data)

填充训练集和测试GDP数据

from sklearn.preprocessing import LabelEncoder
# Initialize LabelEncoder
label_encoders = {col: LabelEncoder() for col in cat_cols}

# Apply LabelEncoder to each categorical column
for col in cat_cols:
    train_data[col] = label_encoders[col].fit_transform(train_data[col])
    test_data[col] = label_encoders[col].transform(test_data[col])
train_date=train_data.pop('date')
test_date=test_data.pop('date')
train_data = train_data.drop('id', axis = 1)
num_cols = list(train_data.select_dtypes(exclude=['object']).columns.difference(['num_sold']))
cat_cols = list(train_data.select_dtypes(include=['object']).columns)

test_data = test_data.drop('id', axis = 1)
num_cols_test = list(test_data.select_dtypes(exclude=['object']).columns.difference(['id']))
cat_cols_test = list(test_data.select_dtypes(include=['object']).columns)
查看数据
train_data.head()
-countrystoreproductnum_soldyearmonthdaygdp
1001973.020101147560.666601
2002906.020101147560.666601
3003423.020101147560.666601
4004491.020101147560.666601
5020300.020101147560.666601
X = train_data.drop(['num_sold'], axis=1)
y = np.log1p(train_data['num_sold'])
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
 # 预测和评估
y_pred = xgb.predict(X_valid)
score = mape(np.expm1(y_valid), np.expm1(y_pred))
me=mse(np.expm1(y_valid), np.expm1(y_pred))
score,me

(0.06931576130020592, 8393.297674737376)

test_pred = xgb.predict(test_data)
submission = pd.DataFrame({'id': test['id'], 'num_sold': np.expm1(test_pred)})
print(submission.head())
submission.to_csv('submission_xgb_log.csv', index=False)
-idnum_sold
0230130134.386810
1230131759.479980
2230132680.182922
3230133350.549652
4230134413.237976

方案四

增加星期几的信息
train_data['weekday']=train_date.dt.weekday
test_data['weekday'] = test_date.dt.weekday
X = train_data.drop(['num_sold'], axis=1)
y = np.log1p(train_data['num_sold'])

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
 # 预测和评估
y_pred = xgb.predict(X_valid)
score = mape(np.expm1(y_valid), np.expm1(y_pred))

(0.045671587590535107, 3451.7557869198386)

test_pred = xgb.predict(test_data)
submission = pd.DataFrame({'id': test['id'], 'num_sold': np.expm1(test_pred)})
print(submission.head())
submission.to_csv('submission_xgb_log11.csv', index=False)
-idnum_sold
0230130162.664078
1230131892.149536
2230132819.040649
3230133412.940216
4230134496.703644

方案五

train_data=train.copy()
test_data=test.copy()
train_data = train_data.dropna()
def eda(df):
    df['date'] = pd.to_datetime(df['date'])
    df['year'] = df['date'].dt.year
    df['day'] = df['date'].dt.day
    df['month'] = df['date'].dt.month
    df['quarter'] = df['date'].dt.quarter
    df['month_name'] = df['date'].dt.month_name()
    df['day_of_week'] = df['date'].dt.day_name()
    df['week'] = df['date'].dt.isocalendar().week
    df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12) 
    df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
    df['quarter_sin'] = np.sin(2 * np.pi * df['quarter'] / 4)
    df['quarter_cos'] = np.cos(2 * np.pi * df['quarter'] / 4)
    df['day_sin'] = np.sin(2 * np.pi * df['day'] / 31)  
    df['day_cos'] = np.cos(2 * np.pi * df['day'] / 31)
    df['group'] = (df['year'] - 2020) * 48 + df['month'] * 4 + df['day'] // 7
    df.drop('date', axis=1, inplace=True)
    df['cos_year'] = np.cos(df['year'] * (2 * np.pi) / 100)
    df['sin_year'] = np.sin(df['year'] * (2 * np.pi) / 100)
    # why using sin/cos? to tell model that after Dec we have Jan, of we dont do this it will
    # consider 1 to 12 and then 12 to 1 wont be considered. same applies on week day also
    # this is universal funnction whenever we have date.
    dummy_prefixes = ['country', 'store', 'product','month_name','day_of_week']
    df = pd.get_dummies(df, columns=dummy_prefixes, drop_first=True)

    return df

train_data = eda(train_data)
test_data = eda(test_data)
train_data = train_data.drop('id',axis=1)
test_data = test_data.drop('id',axis=1)
X = train_data.drop(['num_sold'], axis=1)
y = np.log1p(train_data['num_sold'])

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)
xgb = XGBRegressor(random_state=42)
xgb.fit(X_train, y_train)
# 预测和评估
y_pred = xgb.predict(X_valid)
score = mape(np.expm1(y_valid), np.expm1(y_pred))
me=mse(np.expm1(y_valid), np.expm1(y_pred))
score,me

(0.04549245301889767, 3233.53332460358)

test_pred = xgb.predict(test_data)
submission = pd.DataFrame({'id': test['id'], 'num_sold': np.expm1(test_pred)})
print(submission.head())
submission.to_csv('submission_xgb_log_up.csv', index=False)
-idnum_sold
0230130144.742188
1230131869.433594
2230132759.550293
3230133403.059479
4230134456.856323

这个0.0454924提交后的成绩如下,成绩远不如中篇的不用模型进行比率计算的结果。
在这里插入图片描述

评测总结果

方案MAPE 得分处理方式
方案一0.3015年月日分离
方案二0.06777方案一基础上将标签对数处理
方案三0.06931方案二基础上增加GDP数据
方案四0.04567方案三基础上增加星期几内容
方案五0.04549重新特征工程

总结

一、对时间次序数据的特征处理,有以下

  1. 第一步一般采用年月日分离处理;
  2. 有些国家,特别在销售产品时,节假日(Holiday)也有一定的效果;
  3. GDP数据直接会影响结果,在方案三处理后的效果不明显;
  4. 在销量上周末的特征表现明显;
  5. 通上篇EDA,发现销量是有规律的周期性变化的,因此可以采用三角函数处理方式,效果明显;
    二、本文只对特征工程展开研究和讨论,并没有在建模和调参上做一些说明,因此在模型上还可以选择 RandomforestLightGBMcatboost等回归算法; 在调参方面可以用之前讲过的Optuna 来优化; 通过这些处理,成绩肯定会有所提升;
    三、详测结果只能代表在验证集的效果,最终的成绩只提交看到,因此,两者还存在着较大的差距的。
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值