Market Basket Analysis Using Association Rules in R

本文介绍如何使用R语言中的arules包进行关联规则学习,包括数据收集、探索及准备、模型训练、评估及改进等步骤,并展示了如何通过支持度、置信度和提升度来衡量规则的有效性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

learn to use association rules in R, refer to the book: Machine Learning with R
- Basic idea:
Apriori property: all subsets of a frequent itemset must also be frequent.to reduce the association rule search space.

Whether or not an association rule is deemed interesting is determined by two statistical measures: support and confidence measures.

Measuring rule interestsupport and confidence:
The support of an itemset or rule measures how frequently it occurs in the data:
  
A rule's confidence is a measurement of its predictive power or accuracy:

- Practice:
Step 1 – collecting data
install "arules" R package and use the data “Groceries 

Step 2 – exploring and preparing the data
a)  summary the Groceries:
library(arules)
library(Matrix)
summary(Groceries)
outcome:
transactions as itemMatrix in sparse format with
 9835 rows (elements/itemsets/transactions) and
 169 columns (items) and a density of 0.02609146
most frequent items:
 whole milk  other vegetables       rolls/buns            soda           yogurt          (Other)
      2513        1903             1809             1715             1372            34055
element (itemset/transaction) length distribution:
sizes
1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19 
2159 1643 1299 1005  855  645  545  438  350  246  182  117  78   77   55   46   29   14   14
20   21   22   23   24  26   27   28   29   32
9    11    4    6   1    1    1    1    3    1  
 Min.    1st Qu.  Median  Mean  3rd Qu.  Max.
 1.000   2.000   3.000   4.409   6.000   32.000
includes extended item information - examples:
       labels   level2           level1
1   frankfurter sausage meat and sausage
2   sausage sausage meat and sausage
3   liver loaf sausage meat and sausage
b) inspect
inspect(Groceries[1:5])
outcome:
items                                                               
[1] {citrus fruit,semi-finished bread,margarine,ready soups}            
[2] {tropical fruit,yogurt,coffee}                                      
[3] {whole milk}                                                        
[4] {pip fruit,yogurt,cream cheese ,meat spreads}                       
[5] {other vegetables,whole milk,condensed milk,long life bakery product}
c) To examine a particular item
itemFrequency(Groceries[, 1:3])
d) Plot fuction
1) Support>=0.10 frequency items plot
itemFrequencyPlot(Groceries, support = 0.1)

2) TOP 20 frequency items plot
c) plotting the sparse matrix
image(Groceries[1:5])
image(sample(Groceries,100))

Step 3 – training a model on the data
a) train model
Groceriesrules<-apriori(Groceries,parameter = list(support=0.01,confidence=0.25, minlen=2))
inspect(Groceriesrules[1:3])
Outcome:

b) Set confidence as 0.25 then
Groceriesrules<-apriori(Groceries,parameter =
list(support=0.006,confidence=0.25, minlen=2))
outcome:

See the rules:
Groceriesrules
Outcome:
set of 463 rules

Step 4 – evaluating model performance
a) See the summary of rules:
summary(Groceriesrules)
outcome:

Lift: how much more likely one item or itemset is purchased relative to its typical rate of purchase.
e.g. if lift(milk → bread) is greater than one, it implies that the two items are found together more often than one would expect by chance.
b) rule mining
inspect(Groceriesrules[1:3])
outcome:

The first rule can be read in plain language as, "if a customer buys pot plants, they will also buy whole milk." With support of 0.007 and confidence of 0.400, we can determine that this rule covers 0.7 percent of the transactions andis correct in 40 percent of purchases involving potted plants. The lift value tells ushow much more likely a customer is to buy whole milkrelative to the average customer, given thathe or she bought a potted plant. Since we know that about 25.6 percent of the customers bought whole milk (support), while 40 percent of the customers buying a potted plant bought whole milk (confidence), we can compute the lift value as 0.40 / 0.256 = 1.56, which matches the value shown.

Step 5 – improving model performance
To sort rules due to different criteria:
a) find the rules with high support, confidence, or lift
inspect(sort(groceryrules, by = "lift")[1:5])
outcome:

The first rule, with a lift of about 3.96, implies that people who buy herbs are nearly four times more likely to buy root vegetables than the typical customer.
Order by desc:
 parameterdecreasing = FALSE
b) Taking subsets of association rules
investigate whether berries are often purchased with other items:
berryrules <- subset(Groceriesrules, items %in% "berries")
inspect(berryrules)
Outcome:
  • When you need “berries” only appear in the left or right side, replace “items” with “lhs/rhs”.
  • Chose rules contain “berries” or “yogurt”, replace “%in% "berries"” with “items %in%c("berries", "yogurt”)”

  • Additional operators are available for partial matching (%pin%) and complete matching (%ain%)
  • “confidence > 0.50” would limit you to the rules with confidence greater than 50 percent
  • Matching criteria can be combined with the standard R logical operators such as and (&), or (|), and not (!)
Saving association rules to a file or data frame
a) Save rules as .csv format
setwd("C:\\myprojectR\\Groceries")
write(Groceriesrules, file = "Groceriesrules.csv",
       sep = ",", quote = TRUE, row.names = FALSE)
b) Save rules as R data frame
groceryrules_df <- as(Groceriesrules, "data.frame")
see the results:
str(groceryrules_df)
You might choose to do this if you want to perform additional processing on the rules or need to export them to another database


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值