Kaggle_Predict Future Sales_Prac 1（时间序列预测商品销量）

本文链接：https://blog.youkuaiyun.com/weixin_44216391/article/details/89678031

本次比赛目标：通过时间序列模型，预测接下来一个月,俄罗斯某商超集团每件商品在各个商店的总销售额。

预测了销售额又如何呢？当然作用大大，例如，良性的库存备货提高周转率，发现异常销售以改进，制定成本预算项目，审时度势因地制宜制定有效营销策略，等等。

# 数据集大小：最大数据文件含46万条数据。
# 比赛链接：https://www.kaggle.com/c/competitive-data-science-predict-future-sales

# 本次尝试：
# 先参考kaggle大神分享 https://www.kaggle.com/jagangupta/time-series-basics-exploring-traditional-ts
# 然后遇到Facebook开源工具prophet下载不下来，加上对时间序列模型的使用尚不熟悉，就转为参考书籍《Python数据科学·技术详解与商业实践》中的第十八章“时间序列建模”。

# 所以，全文内容结构是这样的：
# 一、Kaggle商品销售数据初探索。
# 二、《Python数据科学》时间序列模型温习，含定义说明和实例演示。
# 三、回到kaggle商品销售数据预测，将第二部分所学运用于最终的预测。

# Basic packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import random as rd # generating random numbers
import datetime # manipulating date formats
# Viz
import matplotlib.pyplot as plt # basic plotting
import seaborn as sns # for prettier plots


# TIME SERIES
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import adfuller, acf, pacf,arma_order_select_ic
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs


# settings
import warnings
warnings.filterwarnings("ignore")

# Import all of them 
sales=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Predict Future Sales - data science/data-sources/sales_train.csv")

item_cat=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Predict Future Sales - data science/data-sources/item_categories.csv")
item=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Predict Future Sales - data science/data-sources/items.csv")
shops=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Predict Future Sales - data science/data-sources/shops.csv")
test=pd.read_csv("D:/2018_BigData/Python/Kaggle_learning/Predict Future Sales - data science/data-sources/test.csv")

# 源数据文件描述

# sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
# test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
# sample_submission.csv - a sample submission file in the correct format.
# items.csv - supplemental information about the items/products.
# item_categories.csv  - supplemental information about the items categories.
# shops.csv- supplemental information about the shops.


# 数据标签说明

# ID - an Id that represents a (Shop, Item) tuple within the test set
# shop_id - unique identifier of a shop
# item_id - unique identifier of a product
# item_category_id - unique identifier of item category
# item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
# item_price - current price of an item
# date - date in format dd/mm/yyyy
# date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
# item_name - name of item
# shop_name - name of shop
# item_category_name - name of item category

sales.head()

	date	shop_id	item_id	item_price	item_cnt_day
0	02.01.2013	59	22154	999.00	1.0
1	03.01.2013	25	2552	899.00	1.0
2	05.01.2013	25	2552	899.00	-1.0
3	06.01.2013	25	2554	1709.05	1.0
4	15.01.2013	25	2555	1099.00	1.0

item_cat.head()

	item_category_name	item_category_id
0	PC - Гарнитуры/Наушники	0
1	Аксессуары - PS2	1
2	Аксессуары - PS3	2
3	Аксессуары - PS4	3
4	Аксессуары - PSP	4

item.head()

	item_name	item_id	item_category_id
0	! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D	0	40
1	!ABBYY FineReader 12 Professional Edition Full...	1	76
2	***В ЛУЧАХ СЛАВЫ (UNV) D	2	40
3	***ГОЛУБАЯ ВОЛНА (Univ) D	3	40
4	***КОРОБКА (СТЕКЛО) D	4	40

shops.head(3)

	shop_name	shop_id
0	!Якутск Орджоникидзе, 56 фран	0
1	!Якутск ТЦ "Центральный" фран	1
2	Адыгея ТЦ "Мега"	2

test.head(3)

	ID	shop_id	item_id
0	0	5	5037
1	1	5	5320
2	2	5	5233

#formatting the date column correctly
sales.date=sales.date.apply(lambda x:datetime.datetime.strptime(x, '%d.%m.%Y'))
# check
print(sales.info())
print(sales.head(2))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2935849 entries, 0 to 2935848
Data columns (total 6 columns):
date              datetime64[ns]
date_block_num    int64
shop_id           int64
item_id           int64
item_price        float64
item_cnt_day      float64
dtypes: datetime64[ns](1), float64(2), int64(3)
memory usage: 134.4 MB
None
        date  date_block_num  shop_id  item_id  item_price  item_cnt_day
0 2013-01-02               0       59    22154       999.0           1.0
1 2013-01-03               0       25     2552       899.0           1.0

# Aggregate to monthly level the required metrics

monthly_sales=sales.groupby(["date_block_num","shop_id","item_id"])[
    "date","item_price","item_cnt_day"].agg({
   "date":["min",'max'],"item_price":"mean","item_cnt_day":"sum"})

## Lets break down the line of code here:
# aggregate by date-block(month),shop_id and item_id
# select the columns date,item_price and item_cnt(sales)
# Provide a dictionary which says what aggregation to perform on which column
# min and max on the date
# average of the item_price
# sum of the sales

monthly_sales.head(8)

			date		item_price	item_cnt_day
			min	max	mean	sum
date_block_num	shop_id	item_id
0	0	32	2013-01-03	2013-01-31	221.0	6.0
		33	2013-01-03	2013-01-28	347.0	3.0
		35	2013-01-31	2013-01-31	247.0	1.0
		43	2013-01-31	2013-01-31	221.0	1.0
		51	2013-01-13	2013-01-31	128.5	2.0
		61	2013-01-10	2013-01-10	195.0	1.0
		75	2013-01-17	2013-01-17	76.0	1.0
		88	2013-01-16	2013-01-16	76.0	1.0

# number of items per cat 
x=item.groupby(['item_category_id']).count()
x=x.sort_values(by='item_id',ascending=False)
x=x.iloc[0:10].reset_index()
x

	item_category_id	item_name	item_id
0	40	5035	5035
1	55	2365	2365
2	37	1780	1780
3	31	1125	1125
4	58	790	790
5	30	756	756
6	72	666	666
7	19	628	628
8	61	598	598
9	23	501	501

# plot
plt.figure(figsize=(8,4))
ax= sns.barplot(x.item_category_id, x.item_id, alpha=0.8)
plt.title("Items per Category")
plt.ylabel('# of items', fontsize=12)
plt.xlabel('Category', fontsize=12)
plt.show()

# 为什么前面 x 要赋值前十？因为10个数据值比较适合条形图可视化，而原84个值则数量太多，不宜画图。
# seaborn的条形图barplot,明显比matplotlib的条形图,好看很多。
# 还有个发现：seaborn的这个barplot，绘制的时候按照item_category_id升序排列了，估计是默认的。
# 回到分析正题：销量最高的是40号商品，达到了接近5000的量，其次是55号商品约2500销量和37号1900销量。销量第一名40号是遥遥领先。

在这里插入图片描述
png

# 大神原贴说，我们的目的是预测每个商品在每个商店的下个月销售额，属于时间序列预测类型；
# 那首先我们先找个简单的时间序列预测类型练练手，例如就预测下个月所有商品在所有商店的销售总额。

ts=sales.groupby(["date_block_num"])["item_cnt_day"].sum()  # 按月周期，求总和销售额
ts.astype('float') # 转换字符格式成浮点数，方便计算。
plt.figure(figsize=(16,8))  # 画个图
plt.title('Total Sales of the company')
plt.xlabel('Time')
plt.ylabel('Sales')
plt.plot(ts);  #如果此语句替换成 plt.show()，则图形结果是一张有坐标轴和标题的白板，没有任何折线或其他内容。突然想起plot是折线图。。。
# 另外，刚刚发现，上边最后一个分号“;”，原来作用类似等同于plt.show().——可减一行代码，不愧是简洁优美。

# 回到分析上来，随着时间推移，销售总额的两个峰值，答曰在第12个月和第24个月，刚好是年周期。莫非这是其中一个规律？应该就是的。

在这里插入图片描述
png

# 以12个月为周期，看看滚动变化。

# python时间序列分析之_用pandas中的rolling函数计算时间窗口数据——更详细用法可通过搜索引擎深入研究。

plt.figure(figsize=(16,6))
plt.plot(ts.rolling(window=12,center=False).mean(),label='Rolling Mean');
plt.plot(ts.rolling(window=12,center=False).std(),label='Rolling sd');
plt.legend();   # legend()函数表示显示标签，如果没有，则'Rolling Mean'和'Rolling sd'标签不显示。

# 明显看到，滚动月均销售总额在逐月下降，是经济形势逐月下降，还是公司战略和市场认可度发生了负面的变化呢？
# 作为数据分析师，发现这么要命的趋势变化，得进一步深挖原因，最好还能找到应对措施。

# 至于滚动月均标准差，隐隐约约展现了12个月一个周期，而且周期内缓慢上升；而周期外整体是下降趋势。
# 说明年初销售额下降幅度不大，到了年中甚至年尾，一年下来各月销售额下降越来越快，直到次年年初才又稳住下降趋势。

# 此处斗胆推测：
# （1）如果企业经营和选品没问题，那就是整体市场需求发生了变化，而且逐年下降。另外，此类商品在年初需求又比较大，在年中年尾需求较小。
# （2）如果整体市场需求稳定没有下降趋势，那就证明是企业经营或者是选品或者是营销宣传出了问题，需要深入排查解决。

在这里插入图片描述
png

# 下面深入分解：长期趋势Trend、季节性seasonality和随机残差residuals。

# 强行补充小知识：平稳性处理之“分解”
# 所谓分解就是将时序数据分离成不同的成分。statsmodels使用的X-11分解过程，它主要将时序数据分离成长期趋势、季节趋势和随机成分。
# 与其它统计软件一样，statsmodels也支持两类分解模型，加法模型和乘法模型，model的参数设置为"additive"（加法模型）和"multiplicative"（乘法模型）。

import statsmodels.api as sm  # 导入统计建模模块
# multiplicative
res = sm.tsa.seasonal_decompose(ts.values,freq=12,model="multiplicative") 
# 这里用到的.tsa.seasonal_decompose()函数，经尝试：参数ts.values时，横坐标是Time；参数ts时，横坐标是date_block_num。其他不变。
# freg这个参数容后研究，这里暂且猜测是周期12个月。

# plt.figure(figsize=(16,12))
fig = res.plot()
# fig.show()  # 此句，可加可不加。

# 得到不同的分解成分，接下来可以使用时间序列模型对各个成分进行拟合。

在这里插入图片描述
png

res = sm.tsa.seasonal_decompose(ts.values,freq=12,model