-
亲和性分析:
亲和性分析根据样本个体之间的相似度,确定它们关系的亲疏。
- 具体应用场景-
向网站用户提供多样化的服务或投放定向广告;
-
为了向用户推荐电影或商品,而卖给他们一些与之相关的小玩意;
-
根据基因寻找有血缘关系的人…
本例场景:
顾客购买一件商品时,商家可以趁机了解他们还想买什么,以便把多数顾客愿意同时购买的商品放到一起销售以提高销售额。商家收集足够多的数据,就可以对其进行亲和性分析,以确定哪些商品适合放在一起出售。
这里只考虑一次购买两种商品的情况(多件商品的规则更为复杂)
-
制定如下规则:
如果一个人买了商品A,那么他很有可能购买商品B。
下面要开始分析啦!
1. 导入数据
import numpy as np
data = "affinity_dataset.txt"
X = np.loadtxt(data)
print(X[:5]) #查看前五行
输出:
[[0. 1. 0. 0. 0.]
[1. 1. 0. 0. 0.]
[0. 0. 1. 0. 1.]
[1. 1. 0. 0. 0.]
[0. 0. 1. 1. 1.]]
第一行[0, 1, 0, 0, 0] 表示第一条交易数据所包含的商品。每一列代表一种商品。此例中,五种商品分别是面包,牛奶,奶酪,苹果和香蕉。1——顾客至少买了1个单位的该产品,0——顾客没有买该种产品。则 [0, 1, 0, 0, 0] 代表顾客只买了牛奶。
n_samples, n_features = X.shape
print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]
输出:
This dataset has 100 samples and 5 features
2. 用支持度和置信度衡量规则的优劣
实现我们制定的规则,简单粗暴的做法是,找出数据集中所有同时购买的两件商品。找出规则后,还要判断其优劣。常用的是 支持度 (support)和 置信度 (confidence)。
支持度:数据集中规则应验的次数。
支持度衡量的是给定规则应验的比例,而置信度衡量的则是规则的准确率如何。
-
下面我们以 “ 如果顾客购买了苹果,他们也会购买香蕉 ” 为规则,求这条规则的支持度和置信度。
用sample表示一条交易信息,遍历X。判断交易数据中苹果的值是否为1,即sample[3]的值是否为1.
** 求买苹果的人数:
num_apple_purchases = 0
for sample in X:
if sample[3] == 1: # This person bought Apples
num_apple_purchases += 1
print("{0} people bought Apples".format(num_apple_purchases))
输出:
43 people bought Apples
**求既买了苹果,又买了香蕉的人数(即满足规则的人数)
rule_valid = 0
rule_invalid = 0
for sample in X:
if sample[3] == 1: # This person bought Apples
if sample[4] == 1:
# This person bought both Apples and Bananas
rule_valid += 1
else:
# This person bought Apples, but not Bananas
rule_invalid += 1
print("{0} cases of the rule are valid".format(rule_valid))
print("{0} cases of the rule are invalid".format(rule_invalid))
输出:
27 cases of the rule are valid
16 cases of the rule are invalid
27个人既买了苹果,又买了香蕉。
**计算支持度和置信度
support = rule_valid
confidence = rule_valid / num_apple_purchases
print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
print("As a percentage, that is {0:.1f}%.".format(100 * confidence))
输出:
The support is 27 and the confidence is 0.628.
As a percentage, that is 62.8%.
- 统计数据集中所有规则的相关数据
规则:
如果顾客购买A,他们也会购买B。
**创建3个字典(满足规则,不满足规则,条件相同的规则数量)来存放计算结果。
defaultdict:如果键值不存在,返回一个默认值。而使用dict,如果键值不存在,就会报KeyError的错误。
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
for A in range(n_features):
if sample[A] == 0: continue
num_occurences[A] += 1 #满足条件(即值为1),该条件的出现次数+1
for B in range(n_features):
if A == B: continue #跳过A=B的情况
if sample[B] == 1:
valid_rules[(A, B)] += 1
else:
invalid_rules[(A, B)] += 1
**计算每条规则的支持度和置信度
support = valid_rules #支持度
confidence = defaultdict(float) #置信度
for A, B in valid_rules.keys():
confidence[(A, B)] = valid_rules[(A, B)] / num_occurences[A]
输出每条规则及其支持度和置信度:
for A, B in confidence:
A_name = features[A]
B_name = features[B]
print("Rule: If a person buys {0} they will also buy {1}".format(A_name, B_name))
print(" - Confidence: {0:.3f}".format(confidence[(A, B)]))
print(" - Support: {0}".format(support[(A, B)]))
print("")
输出:
Rule: If a person buys bread they will also buy milk
- Confidence: 0.464
- Support: 13
Rule: If a person buys milk they will also buy bread
- Confidence: 0.250
- Support: 13
Rule: If a person buys cheese they will also buy bananas
- Confidence: 0.513
- Support: 20
Rule: If a person buys bananas they will also buy cheese
- Confidence: 0.351
- Support: 20
Rule: If a person buys cheese they will also buy apples
- Confidence: 0.564
- Support: 22
Rule: If a person buys apples they will also buy cheese
- Confidence: 0.512
- Support: 22
Rule: If a person buys apples they will also buy bananas
- Confidence: 0.628
- Support: 27
Rule: If a person buys bananas they will also buy apples
- Confidence: 0.474
- Support: 27
Rule: If a person buys milk they will also buy apples
- Confidence: 0.346
- Support: 18
Rule: If a person buys apples they will also buy milk
- Confidence: 0.419
- Support: 18
Rule: If a person buys milk they will also buy bananas
- Confidence: 0.519
- Support: 27
Rule: If a person buys bananas they will also buy milk
- Confidence: 0.474
- Support: 27
Rule: If a person buys bread they will also buy cheese
- Confidence: 0.179
- Support: 5
Rule: If a person buys cheese they will also buy bread
- Confidence: 0.128
- Support: 5
Rule: If a person buys bread they will also buy bananas
- Confidence: 0.571
- Support: 16
Rule: If a person buys bananas they will also buy bread
- Confidence: 0.281
- Support: 16
Rule: If a person buys milk they will also buy cheese
- Confidence: 0.212
- Support: 11
Rule: If a person buys cheese they will also buy milk
- Confidence: 0.282
- Support: 11
Rule: If a person buys bread they will also buy apples
- Confidence: 0.321
- Support: 9
Rule: If a person buys apples they will also buy bread
- Confidence: 0.209
- Support: 9
也可以定义一个函数啦:
def print_rule(A, B, support, confidence, features):
A_name = features[A]
B_name = features[B]
print("Rule: If a person buy {0} they will also buy {1}".format(A_name, B_name))
print(" - Support: {0}".format(support[(A,B)]))
print(" - Conclusion: {0:.3f}".format(confidence[(A,B)]))
A = 1
B = 3
print_rule(A, B, support, confidence, features)
输出:
Rule: If a person buy milk they will also buy apples
- Support: 18
- Conclusion: 0.346
- 排序
- 按支持度排序
operator.itemgetter函数
operator模块提供的itemgetter函数用于获取对象的哪些维的数据,参数为一些序号(即需要获取的数据在对象中的序号)
例:
a = [1,2,3]
>>> b=operator.itemgetter(1) //定义函数b,获取对象的第1个域的值
>>> b(a)
2
>>> b=operator.itemgetter(1,0) //定义函数b,获取对象的第1个域和第0个域的值
>>> b(a)
(2, 1)
!!注意哦!operator.itemgetter函数获取的不是值,而是定义了一个函数,通过该函数作用到对象上才能获取值。
字典的items()函数返回包含字典的所有元素的列表
#查看support的元素
print(support.items())
#输出
dict_items([((0, 1), 13), ((1, 0), 13), ((2, 4), 20), ((4, 2), 20), ((2, 3), 22), ((3, 2), 22), ((3, 4), 27), ((4, 3), 27), ((1, 3), 18), ((3, 1), 18), ((1, 4), 27), ((4, 1), 27), ((0, 2), 5), ((2, 0), 5), ((0, 4), 16), ((4, 0), 16), ((1, 2), 11), ((2, 1), 11), ((0, 3), 9), ((3, 0), 9)])
所以这里sorted函数排序的键值取itemgetter(1),即对支持度进行排序。
from operator import itemgetter
sorted_support = sorted(support.items(), key = itemgetter(1), reverse = True)
#输出支持度最高的前五条规则
for index in range(5):
print("Rule #{0}".format(index + 1))
A, B, = sorted_support[index][0]
print_rule(A, B, support, confidence, features)
输出:
Rule #1
Rule: If a person buy apples they will also buy bananas
- Support: 27
- Conclusion: 0.628
Rule #2
Rule: If a person buy bananas they will also buy apples
- Support: 27
- Conclusion: 0.474
Rule #3
Rule: If a person buy milk they will also buy bananas
- Support: 27
- Conclusion: 0.519
Rule #4
Rule: If a person buy bananas they will also buy milk
- Support: 27
- Conclusion: 0.474
Rule #5
Rule: If a person buy cheese they will also buy apples
- Support: 22
- Conclusion: 0.564
- 按置信度排序
sorted_confidence = sorted(confidence.items(), key = itemgetter(1), reverse = True)
#输出置信度最高的前五条规则
for index in range(5):
print("Rule #{0}".format(index + 1))
A, B, = sorted_confidence[index][0]
print_rule(A, B, support, confidence, features)
输出:
Rule #1
Rule: If a person buy apples they will also buy bananas
- Support: 27
- Conclusion: 0.628
Rule #2
Rule: If a person buy bread they will also buy bananas
- Support: 16
- Conclusion: 0.571
Rule #3
Rule: If a person buy cheese they will also buy apples
- Support: 22
- Conclusion: 0.564
Rule #4
Rule: If a person buy milk they will also buy bananas
- Support: 27
- Conclusion: 0.519
Rule #5
Rule: If a person buy cheese they will also buy bananas
- Support: 20
- Conclusion: 0.513
从排序结果看,“顾客买苹果。也会买香蕉”这条规则的支持度和置信度都最高。所以超市可以把香蕉和苹果摆放在一起,苹果可以搞促销,但苹果和香蕉不要同时搞促销,因为我们发现购买苹果的顾客中,接近63%的人即使不搞促销也会买香蕉——即使搞促销,也不会给销量带来太大的提升。
呼~终于写完啦,开心!继续加油啦!