本分析数据集采用Kaggle的E-commerce data。
数据分析思路:Kaggle-Customer Segmentation
数据导入详见:数据分析初阶——数据导入
数据初步处理详见:数据分析初阶——数据初步处理
1. 各变量初步探索
在进行数据分析前,需要先对各个字段做简单的了解。
InvoiceNo:订单号。如果在单号前有加C,意思是这个订单是一个取消订单。
StockCode:商品代码。如果是同一商品则是唯一的。
Description:商品描述。
Quantity:订单商品数量。
InvoiceDate:订单日期。
UnitPrice:商品单位价格。
CustomerID:客户ID。
Country:客户所在国家。
1.1 国家
SELECT Country,count(*) FROM ecom.data
GROUP BY Country
对Country计数,结果如下:
表1 订单国籍分布计数(部分)
Country | United Kingdom | Germany | France | EIRE | Spain | Netherlands | Belgium | Switzerland | ... |
Count | 361878 | 9495 | 8491 | 7485 | 2533 | 2371 | 2069 | 1877 | ... |
Percentage | 88.95% | 2.33% | 2.09% | 1.84% | 0.62% | 0.58% | 0.51% | 0.46% | ... |
由上可知,英国的订单占比最高,其次是德国。
1.2 顾客和产品
根据字段的描述,我们可以通过计数来计算产品、顾客和总订单数。
SELECT
count(DISTINCT StockCode) AS products ,
count(DISTINCT InvoiceNo) AS transactions ,
count(DISTINCT CustomerID) AS customers
FROM ecom.data
products | transactions | customers | |
count | 3684 | 22190 | 4372 |
由上可知,本数据集包括产品3684项,顾客4372名,一共22190个订单。
接下来我们以顾客和订单号进行分组并排序。
SELECT
CustomerID,
InvoiceNo,
count(InvoiceDate) AS 'Number of products'
FROM ecom.data
GROUP BY CustomerID , InvoiceNo
ORDER BY CustomerID
我们能得出以下结果。
CustomerID | InvoiceNo | Number of products |
12346 | 541431 | 1 |
12346 | C541433 | 1 |
12347 | 537626 | 31 |
12347 | 542237 | 29 |
12347 | 549222 | 24 |
... | ... | ... |
从上面我们能看出:
(1)有顾客是只买一次商品的;
(2)其他顾客一般都买超过10次的商品。
1.2.1 关于取消的订单
我们先计算出取消订单的基本情况。
SELECT
CustomerID,
InvoiceNo,
count(InvoiceDate) AS 'Number of products' ,
CASE WHEN InvoiceNo LIKE 'C%'
THEN count(InvoiceDate)
ELSE 0
END AS 'Cancel Number of products'
FROM ecom.data
GROUP BY CustomerID , InvoiceNo
ORDER BY CustomerID
结果如下:
CustomerID | InvoiceNo | Number of products | Cancel Number of products |
12346 | 541431 | 1 | 0 |
12346 | C541433 | 1 | 1 |
12347 | 537626 | 31 | 0 |
12347 | 542237 | 29 | 0 |
12347 | 549222 | 24 | 0 |
12352 | C547388 | 7 | 7 |
... | ... | ... | ... |
然后我们对其进行汇总计算:
SELECT
count(CNOP >= 1 OR null) AS 'Count Cancel Number of Invoice',
count(NOP) AS 'Count Number of Invoice',
(count(CNOP >= 1 OR null)/count(NOP))*100 AS Percentage
FROM(
SELECT
CustomerID,
InvoiceNo,
count(InvoiceDate) AS 'NOP' ,
CASE WHEN InvoiceNo LIKE 'C%'
THEN count(InvoiceDate)
ELSE 0
END AS 'CNOP'
FROM ecom.data
GROUP BY CustomerID , InvoiceNo
ORDER BY CustomerID) t
可得:
Count Number of Invoice | Count Cancel Number of Invoice | Percentage |
3654 | 22190 | 16.4669 |
由此可知,大约有16.5%的订单是被取消掉的。
关于如何使用有条件的COUNT
如果要使用有条件的count函数,我们一般采取以下三种方法:(原理详见:Mysql中使用count加条件统计 ——AlbertS)
select count(字段 条件 or null) from 表格;
select
count(if(字段 条件, 1, null))
from 表格;
select
count(case when 字段 条件
then 1
end)
from 表格