Kaggle线上零售 CRM分析(RFM+BG-NBD+生存分析+PySpark)

数据集地址:数据集地址
我的NoteBook地址:NoteBook地址
这个此在线零售数据集包含2009年12月1日至2011年12月9日期间的在线零售的所有交易。该公司主要销售独特的各种场合礼品。这家公司的许多客户都是批发商。本文将通过pyspark对数据进行导入与预处理,进行可视化分析并使用RFM、生存分析与BG-NBD模型进行对购买客户的各项分析。

1、数据集导入与清洗预处理

这一部分我们将数据集使用PySpark导入、转换为我们要的数据类型,并过滤掉非正常商品、负单子、处理缺失值。代码贴在这里单纯是为了更加熟悉pyspark的各个算子。

from pyspark import SparkContext
from pyspark.sql import functions as F, SparkSession, Column,types
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")
spark = SparkSession.builder.appName("CRMAnalysis").getOrCreate()
data = spark.read.csv("/kaggle/input/online-retail-ii-uci/online_retail_II.csv",header=True,sep=",")
## 数据类型 Convert Data 
data = data.withColumn("Quantity",data["Quantity"].cast("int"))
data = data.withColumn("Customer ID",data["Customer ID"].cast("int"))
data = data.withColumn("InvoiceDate",F.to_date(data["InvoiceDate"],"yyyy-MM-dd HH:mm:ss"))
data = data.withColumn("Price",data["Price"].cast("decimal"))
## 数据类型 Convert Data 
data = data.withColumn("Quantity",data["Quantity"].cast("int"))
data = data.withColumn("Customer ID",data["Customer ID"].cast("int"))
data = data.withColumn("InvoiceDate",F.to_date(data["InvoiceDate"],"yyyy-MM-dd HH:mm:ss"))
data = data.withColumn("Price",data["Price"].cast("decimal"))
## 缺失值 Deal Missing Values
data_null_agg = data.agg(*[F.count(F.when(F.isnull(c),c)).alias(c) for c in data.columns])
data_null_agg.show()
data = data.fillna({
   
   "Customer ID":99999})
## <0的条目数 C打头的Invoice删去 Quantity与Price为负数的删去 Filter out Invoices that begin with "C"(cancel) and Quantity/Price < 0
data_filtered = data.where(F.udf(lambda x:x[0].upper()!="C",types.BooleanType())(F.col("Invoice")))
data_filtered = data_filtered.filter("Quantity>0 and Price>0")
## 增加Total Price
data_filtered = data_filtered.withColumn("Total Price",F.col("Quantity")*F.col("Price"))

## 时间变量 Convert Date columns
data_filtered = data_filtered.withColumn("Year",F.year(data_filtered["InvoiceDate"]))
data_filtered = data_filtered.withColumn("Month",F.month(data_filtered["InvoiceDate"]))
data_filtered = data_filtered.withColumn("DayOfWeek",F.dayofweek(data_filtered["InvoiceDate"])-1)
data_filtered = data_filtered.withColumn("DayOfWeek",F.udf(lambda x:x if x!=0 else 7,types.IntegerType())(data_filtered["DayOfWeek"]))
## 为准备prepare DataFrames for EDA
data_monthly = data_filtered.groupby(["Year","Month"]).agg(F.sum(F.col("Total Price")).alias("Total Price")
                                                      ,F.sum(F.col("Quantity")).alias("Quantity")
                                                      ,F.countDistinct(F.col("Invoice")).alias("Nr_Of_Transaction")).toPandas()

data_day_of_Week = data_filtered.groupby(["DayOfWeek"]).agg(F.sum(F.col("Total Price")).alias("Total Price")
                                                      ,F.sum(F.col("Quantity")).alias("Quantity")
                                                      ,F.countDistinct(F.col("Invoice")).alias("Nr_Of_Transaction")).toPandas()

data_all_stock = data_filtered.groupby(["StockCode","Description","Year","Month"]).agg(F.sum(F.col("Total Price")).alias("Total Price")
                                                      ,F.sum(F.col("Quantity")).alias("Quantity")
                                                    ,F.countDistinct(F.col("Invoice")).alias("Nr_Of_Transaction")).toPandas()

data_all_customer = data_filtered.groupby(["Country","Customer ID","Year","Month"]).agg(F.sum(F.col("Total Price")).alias("Total Price")
                                                      ,F.sum(F.col("Quantity")).alias("Quantity")
                                                    ,F.countDistinct(F.col("Invoice")).alias("Nr_Of_Transaction")).toPandas()

2 探索性分析

首先让我们对总体的销售数据做一个分析

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
import plotly.graph_objects as go
data_monthly = data_filtered.groupby(["Year","Month"]).agg(F.sum(F.col("Total Price")).alias("Total Price")
                                                      ,F.sum(F.col("Quantity")).alias("Quantity")
                                                      ,F.countDistinct(F.col("Invoice")).alias("Nr_Of_Transaction")).toPandas()
data_monthly = data_monthly.sort_values(["Year","Month"]).reset_index(drop=True)
plt.figure(figsize=(20,10))
plt.plot(data_monthly["Total Price"])
plt.xticks(data_monthly.index,[str(data_monthly.loc[i,"Year"])+"\n"+str(data_monthly.loc[i,"Month"]) for i in data_monthly.index])
plt.title("Trend of All Sales")
plt.show()

在这里插入图片描述
从整体趋势来看,销售在每年的11月都会有一个明显的顶峰,到了12月又急速落回。

data_day_of_Week = data_day_of_Week.sort_values(["DayOfWeek"]).reset_index(drop=True)
go.Figure(go.Bar(x=data_day_of_Week["DayOfWeek"],y=data_day_of_Week["Total Price"]),layout={
   
   "title":"DayOfWeek Total Sales"})

在这里插入图片描述
周六没有任何的销售,周日的销售明显低于其他工作日的销售。

接下来将数据粒度放在每个客户上,看看客户的销售状况。

from collections import Counter
data_all_customer=data_all_customer[data_all_customer["Customer ID"]!=99999].reset_index(drop=True)
all_countrys = data_all_customer.groupby("Country")["Customer ID"].nunique().sort_values(ascending=False)
all_countrys["Others"] = sum(all_countrys[all_countrys.index[3:]])
go.Figure(go.Pie(labels=all_countrys.index[0:3].tolist()+["Others"],values=all_countrys[all_countrys.index[0:3].tolist()+["Others"]]),
         layout={
   
   
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值