Python亲和性分析

最新推荐文章于 2022-07-07 15:34:51 发布

甲鱼呀！

最新推荐文章于 2022-07-07 15:34:51 发布

阅读量555

点赞数 1

分类专栏： Python 文章标签： Python 亲和性分析数据挖掘

本文链接：https://blog.youkuaiyun.com/qq_44344446/article/details/97273243

版权

Python 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了亲和性分析在数据挖掘中的应用，通过Python展示了如何利用支持度和置信度衡量规则，以确定商品关联性，提升销售额。具体例子是分析顾客购买行为，找出苹果和香蕉的高关联性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

亲和性分析：
亲和性分析根据样本个体之间的相似度，确定它们关系的亲疏。
- 具体应用场景
- 向网站用户提供多样化的服务或投放定向广告；
- 为了向用户推荐电影或商品，而卖给他们一些与之相关的小玩意；
- 根据基因寻找有血缘关系的人…
  
  本例场景：
  顾客购买一件商品时，商家可以趁机了解他们还想买什么，以便把多数顾客愿意同时购买的商品放到一起销售以提高销售额。商家收集足够多的数据，就可以对其进行亲和性分析，以确定哪些商品适合放在一起出售。
  这里只考虑一次购买两种商品的情况（多件商品的规则更为复杂）

制定如下规则：
如果一个人买了商品A，那么他很有可能购买商品B。
下面要开始分析啦！

1. 导入数据

import numpy as np
data = "affinity_dataset.txt"
X = np.loadtxt(data)

print(X[:5])	#查看前五行

输出：

[[0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1.]
 [1. 1. 0. 0. 0.]
 [0. 0. 1. 1. 1.]]

第一行[0, 1, 0, 0, 0] 表示第一条交易数据所包含的商品。每一列代表一种商品。此例中，五种商品分别是面包，牛奶，奶酪，苹果和香蕉。1——顾客至少买了1个单位的该产品，0——顾客没有买该种产品。则 [0, 1, 0, 0, 0] 代表顾客只买了牛奶。

n_samples, n_features = X.shape
print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]

输出：

This dataset has 100 samples and 5 features

2. 用支持度和置信度衡量规则的优劣
实现我们制定的规则，简单粗暴的做法是，找出数据集中所有同时购买的两件商品。找出规则后，还要判断其优劣。常用的是 支持度 (support)和 置信度 (confidence)。
支持度：数据集中规则应验的次数。
支持度衡量的是给定规则应验的比例，而置信度衡量的则是规则的准确率如何。

下面我们以 “ 如果顾客购买了苹果，他们也会购买香蕉 ” 为规则，求这条规则的支持度和置信度。

用sample表示一条交易信息，遍历X。判断交易数据中苹果的值是否为1，即sample[3]的值是否为1.

** 求买苹果的人数：

num_apple_purchases = 0
for sample in X:
   if sample[3] == 1:  # This person bought Apples
       num_apple_purchases += 1
print("{0} people bought Apples".format(num_apple_purchases))

输出：

43 people bought Apples

**求既买了苹果，又买了香蕉的人数（即满足规则的人数）

rule_valid = 0
rule_invalid = 0
for sample in X:
    if sample[3] == 1:  # This person bought Apples
        if sample[4] == 1:
            # This person bought both Apples and Bananas
            rule_valid += 1
        else:
            # This person bought Apples, but not Bananas
            rule_invalid += 1
print("{0} cases of the rule are valid".format(rule_valid))
print("{0} cases of the rule are invalid".format(rule_invalid))

输出：

27 cases of the rule are valid
16 cases of the rule are invalid

27个人既买了苹果，又买了香蕉。

**计算支持度和置信度

support = rule_valid 
confidence = rule_valid / num_apple_purchases
print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
print("As a percentage, that is {0:.1f}%.".format(100 * confidence))

输出：

The support is 27 and the confidence is 0.628.
As a percentage, that is 62.8%.

统计数据集中所有规则的相关数据
规则：
如果顾客购买A，他们也会购买B。

**创建3个字典（满足规则，不满足规则，条件相同的规则数量）来存放计算结果。

defaultdict：如果键值不存在，返回一个默认值。而使用dict，如果键值不存在，就会报KeyError的错误。

from collections import defaultdict
valid_rules  = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

for sample in X:
	for A in range(n_features):
		if sample[A] == 0: continue
		num_occurences[A] += 1	#满足条件（即值为1），该条件的出现次数+1
        for B in range(n_features):
            if A == B: continue	#跳过A=B的情况
        if sample[B] == 1:
            valid_rules[(A, B)] += 1
        else:
            invalid_rules[(A, B)] += 1

**计算每条规则的支持度和置信度

support = valid_rules	#支持度
confidence = defaultdict(float)		#置信度
for A, B in valid_rules.keys():
    confidence[(A, B)] = valid_rules[(A, B)] / num_occurences[A]

输出每条规则及其支持度和置信度：

for A, B in confidence:
    A_name = features[A]
    B_name = features[B]
    print("Rule: If a person buys {0} they will also buy {1}".format(A_name, B_name))
    print(" - Confidence: {0:.3f}".format(confidence[(A, B)]))
    print(" - Support: {0}".format(support[(A, B)]))
    print("")

输出：

Rule: If a person buys bread they will also buy milk
 - Confidence: 0.464
 - Support: 13

Rule: If a person buys milk they will also buy bread
 - Confidence: 0.250
 - Support: 13

Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.513
 - Support: 20

Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.351
 - Support: 20

Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.564
 - Support: 22

Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.512
 - Support: 22

Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.628
 - Support: 27

Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.474
 - Support: 27

Rule: If a person buys milk they will also buy apples
 - Confidence: 0.346
 - Support: 18

Rule: If a person buys apples they will also buy milk
 - Confidence: 0.419
 - Support: 18

Rule: If a person buys milk they will also buy bananas
 - Confidence: 0.519
 - Support: 27

Rule: If a person buys bananas they will also buy milk
 - Confidence: 0.474
 - Support: 27

Rule: If a person buys bread they will also buy cheese
 - Confidence: 0.179
 - Support: 5

Rule: If a person buys cheese they will also buy bread
 - Confidence: 0.128
 - Support: 5

Rule: If a person buys bread they will also buy bananas
 - Confidence: 0.571
 - Support: 16

Rule: If a person buys bananas they will also buy bread
 - Confidence: 0.281
 - Support: 16

Rule: If a person buys milk they will also buy cheese
 - Confidence: 0.212
 - Support: 11

Rule: If a person buys cheese they will also buy milk
 - Confidence: 0.282
 - Support: 11

Rule: If a person buys bread they will also buy apples
 - Confidence: 0.321
 - Support: 9

Rule: If a person buys apples they will also buy bread
 - Confidence: 0.209
 - Support: 9

也可以定义一个函数啦：

def print_rule(A, B, support, confidence, features):
    A_name = features[A]
    B_name = features[B]
    print("Rule: If a person buy {0} they will also buy {1}".format(A_name, B_name))
    print(" - Support: {0}".format(support[(A,B)]))
    print(" - Conclusion: {0:.3f}".format(confidence[(A,B)]))

A = 1
B = 3
print_rule(A, B, support, confidence, features)

输出：

Rule: If a person buy milk they will also buy apples
 - Support: 18
 - Conclusion: 0.346

排序

按支持度排序

operator.itemgetter函数
operator模块提供的itemgetter函数用于获取对象的哪些维的数据，参数为一些序号（即需要获取的数据在对象中的序号）
例：

a = [1,2,3] 
>>> b=operator.itemgetter(1)      //定义函数b，获取对象的第1个域的值
>>> b(a) 
2 
>>> b=operator.itemgetter(1,0)   //定义函数b，获取对象的第1个域和第0个域的值
>>> b(a) 
(2, 1)

！！注意哦！operator.itemgetter函数获取的不是值，而是定义了一个函数，通过该函数作用到对象上才能获取值。

字典的items()函数返回包含字典的所有元素的列表

#查看support的元素
print(support.items())
#输出
dict_items([((0, 1), 13), ((1, 0), 13), ((2, 4), 20), ((4, 2), 20), ((2, 3), 22), ((3, 2), 22), ((3, 4), 27), ((4, 3), 27), ((1, 3), 18), ((3, 1), 18), ((1, 4), 27), ((4, 1), 27), ((0, 2), 5), ((2, 0), 5), ((0, 4), 16), ((4, 0), 16), ((1, 2), 11), ((2, 1), 11), ((0, 3), 9), ((3, 0), 9)])

所以这里sorted函数排序的键值取itemgetter（1），即对支持度进行排序。

from operator import itemgetter
sorted_support = sorted(support.items(), key = itemgetter(1), reverse = True)
#输出支持度最高的前五条规则
for index in range(5):
    print("Rule #{0}".format(index + 1))
    A, B, = sorted_support[index][0]
    print_rule(A, B, support, confidence, features)

输出：

Rule #1
Rule: If a person buy apples they will also buy bananas
 - Support: 27
 - Conclusion: 0.628
Rule #2
Rule: If a person buy bananas they will also buy apples
 - Support: 27
 - Conclusion: 0.474
Rule #3
Rule: If a person buy milk they will also buy bananas
 - Support: 27
 - Conclusion: 0.519
Rule #4
Rule: If a person buy bananas they will also buy milk
 - Support: 27
 - Conclusion: 0.474
Rule #5
Rule: If a person buy cheese they will also buy apples
 - Support: 22
 - Conclusion: 0.564

按置信度排序

sorted_confidence = sorted(confidence.items(), key = itemgetter(1), reverse = True)
#输出置信度最高的前五条规则
for index in range(5):
    print("Rule #{0}".format(index + 1))
    A, B, = sorted_confidence[index][0]
    print_rule(A, B, support, confidence, features)

输出：

Rule #1
Rule: If a person buy apples they will also buy bananas
 - Support: 27
 - Conclusion: 0.628
Rule #2
Rule: If a person buy bread they will also buy bananas
 - Support: 16
 - Conclusion: 0.571
Rule #3
Rule: If a person buy cheese they will also buy apples
 - Support: 22
 - Conclusion: 0.564
Rule #4
Rule: If a person buy milk they will also buy bananas
 - Support: 27
 - Conclusion: 0.519
Rule #5
Rule: If a person buy cheese they will also buy bananas
 - Support: 20
 - Conclusion: 0.513

从排序结果看，“顾客买苹果。也会买香蕉”这条规则的支持度和置信度都最高。所以超市可以把香蕉和苹果摆放在一起，苹果可以搞促销，但苹果和香蕉不要同时搞促销，因为我们发现购买苹果的顾客中，接近63%的人即使不搞促销也会买香蕉——即使搞促销，也不会给销量带来太大的提升。

呼~终于写完啦，开心！继续加油啦！