Python亲和性分析

本文介绍了亲和性分析在数据挖掘中的应用,通过Python展示了如何利用支持度和置信度衡量规则,以确定商品关联性,提升销售额。具体例子是分析顾客购买行为,找出苹果和香蕉的高关联性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

  • 亲和性分析:
    亲和性分析根据样本个体之间的相似度,确定它们关系的亲疏。
    - 具体应用场景

    • 向网站用户提供多样化的服务或投放定向广告;

    • 为了向用户推荐电影或商品,而卖给他们一些与之相关的小玩意;

    • 根据基因寻找有血缘关系的人…

      本例场景:
      顾客购买一件商品时,商家可以趁机了解他们还想买什么,以便把多数顾客愿意同时购买的商品放到一起销售以提高销售额。商家收集足够多的数据,就可以对其进行亲和性分析,以确定哪些商品适合放在一起出售。
      这里只考虑一次购买两种商品的情况(多件商品的规则更为复杂)

制定如下规则
如果一个人买了商品A,那么他很有可能购买商品B。
下面要开始分析啦!

1. 导入数据

import numpy as np
data = "affinity_dataset.txt"
X = np.loadtxt(data)
print(X[:5])	#查看前五行

输出:

[[0. 1. 0. 0. 0.]
 [1. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1.]
 [1. 1. 0. 0. 0.]
 [0. 0. 1. 1. 1.]]

第一行[0, 1, 0, 0, 0] 表示第一条交易数据所包含的商品。每一列代表一种商品。此例中,五种商品分别是面包,牛奶,奶酪,苹果和香蕉。1——顾客至少买了1个单位的该产品,0——顾客没有买该种产品。则 [0, 1, 0, 0, 0] 代表顾客只买了牛奶。

n_samples, n_features = X.shape
print("This dataset has {0} samples and {1} features".format(n_samples, n_features))
# The names of the features, for your reference.
features = ["bread", "milk", "cheese", "apples", "bananas"]

输出:

This dataset has 100 samples and 5 features

2. 用支持度和置信度衡量规则的优劣
实现我们制定的规则,简单粗暴的做法是,找出数据集中所有同时购买的两件商品。找出规则后,还要判断其优劣。常用的是 支持度 (support)和 置信度 (confidence)。
支持度:数据集中规则应验的次数。
支持度衡量的是给定规则应验的比例,而置信度衡量的则是规则的准确率如何。

  • 下面我们以 “ 如果顾客购买了苹果,他们也会购买香蕉 ” 为规则,求这条规则的支持度和置信度。

    用sample表示一条交易信息,遍历X。判断交易数据中苹果的值是否为1,即sample[3]的值是否为1.

** 求买苹果的人数:

num_apple_purchases = 0
for sample in X:
   if sample[3] == 1:  # This person bought Apples
       num_apple_purchases += 1
print("{0} people bought Apples".format(num_apple_purchases))

输出:

43 people bought Apples

**求既买了苹果,又买了香蕉的人数(即满足规则的人数)

rule_valid = 0
rule_invalid = 0
for sample in X:
    if sample[3] == 1:  # This person bought Apples
        if sample[4] == 1:
            # This person bought both Apples and Bananas
            rule_valid += 1
        else:
            # This person bought Apples, but not Bananas
            rule_invalid += 1
print("{0} cases of the rule are valid".format(rule_valid))
print("{0} cases of the rule are invalid".format(rule_invalid))

输出:

27 cases of the rule are valid
16 cases of the rule are invalid

27个人既买了苹果,又买了香蕉。

**计算支持度和置信度

support = rule_valid 
confidence = rule_valid / num_apple_purchases
print("The support is {0} and the confidence is {1:.3f}.".format(support, confidence))
print("As a percentage, that is {0:.1f}%.".format(100 * confidence))

输出:

The support is 27 and the confidence is 0.628.
As a percentage, that is 62.8%.
  • 统计数据集中所有规则的相关数据
    规则:
    如果顾客购买A,他们也会购买B。

**创建3个字典(满足规则,不满足规则,条件相同的规则数量)来存放计算结果。

defaultdict:如果键值不存在,返回一个默认值。而使用dict,如果键值不存在,就会报KeyError的错误。

from collections import defaultdict
valid_rules  = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
for sample in X:
	for A in range(n_features):
		if sample[A] == 0: continue
		num_occurences[A] += 1	#满足条件(即值为1),该条件的出现次数+1
        for B in range(n_features):
            if A == B: continue	#跳过A=B的情况
        if sample[B] == 1:
            valid_rules[(A, B)] += 1
        else:
            invalid_rules[(A, B)] += 1	

**计算每条规则的支持度和置信度

support = valid_rules	#支持度
confidence = defaultdict(float)		#置信度
for A, B in valid_rules.keys():
    confidence[(A, B)] = valid_rules[(A, B)] / num_occurences[A]

输出每条规则及其支持度和置信度:

for A, B in confidence:
    A_name = features[A]
    B_name = features[B]
    print("Rule: If a person buys {0} they will also buy {1}".format(A_name, B_name))
    print(" - Confidence: {0:.3f}".format(confidence[(A, B)]))
    print(" - Support: {0}".format(support[(A, B)]))
    print("")

输出:

Rule: If a person buys bread they will also buy milk
 - Confidence: 0.464
 - Support: 13

Rule: If a person buys milk they will also buy bread
 - Confidence: 0.250
 - Support: 13

Rule: If a person buys cheese they will also buy bananas
 - Confidence: 0.513
 - Support: 20

Rule: If a person buys bananas they will also buy cheese
 - Confidence: 0.351
 - Support: 20

Rule: If a person buys cheese they will also buy apples
 - Confidence: 0.564
 - Support: 22

Rule: If a person buys apples they will also buy cheese
 - Confidence: 0.512
 - Support: 22

Rule: If a person buys apples they will also buy bananas
 - Confidence: 0.628
 - Support: 27

Rule: If a person buys bananas they will also buy apples
 - Confidence: 0.474
 - Support: 27

Rule: If a person buys milk they will also buy apples
 - Confidence: 0.346
 - Support: 18

Rule: If a person buys apples they will also buy milk
 - Confidence: 0.419
 - Support: 18

Rule: If a person buys milk they will also buy bananas
 - Confidence: 0.519
 - Support: 27

Rule: If a person buys bananas they will also buy milk
 - Confidence: 0.474
 - Support: 27

Rule: If a person buys bread they will also buy cheese
 - Confidence: 0.179
 - Support: 5

Rule: If a person buys cheese they will also buy bread
 - Confidence: 0.128
 - Support: 5

Rule: If a person buys bread they will also buy bananas
 - Confidence: 0.571
 - Support: 16

Rule: If a person buys bananas they will also buy bread
 - Confidence: 0.281
 - Support: 16

Rule: If a person buys milk they will also buy cheese
 - Confidence: 0.212
 - Support: 11

Rule: If a person buys cheese they will also buy milk
 - Confidence: 0.282
 - Support: 11

Rule: If a person buys bread they will also buy apples
 - Confidence: 0.321
 - Support: 9

Rule: If a person buys apples they will also buy bread
 - Confidence: 0.209
 - Support: 9

也可以定义一个函数啦:

def print_rule(A, B, support, confidence, features):
    A_name = features[A]
    B_name = features[B]
    print("Rule: If a person buy {0} they will also buy {1}".format(A_name, B_name))
    print(" - Support: {0}".format(support[(A,B)]))
    print(" - Conclusion: {0:.3f}".format(confidence[(A,B)]))
A = 1
B = 3
print_rule(A, B, support, confidence, features)

输出:

Rule: If a person buy milk they will also buy apples
 - Support: 18
 - Conclusion: 0.346
  • 排序
  1. 按支持度排序

operator.itemgetter函数
operator模块提供的itemgetter函数用于获取对象的哪些维的数据,参数为一些序号(即需要获取的数据在对象中的序号)
例:

a = [1,2,3] 
>>> b=operator.itemgetter(1)      //定义函数b,获取对象的第1个域的值
>>> b(a) 
2 
>>> b=operator.itemgetter(1,0)   //定义函数b,获取对象的第1个域和第0个域的值
>>> b(a) 
(2, 1) 

!!注意哦!operator.itemgetter函数获取的不是值,而是定义了一个函数,通过该函数作用到对象上才能获取值。

字典的items()函数返回包含字典的所有元素的列表

#查看support的元素
print(support.items())
#输出
dict_items([((0, 1), 13), ((1, 0), 13), ((2, 4), 20), ((4, 2), 20), ((2, 3), 22), ((3, 2), 22), ((3, 4), 27), ((4, 3), 27), ((1, 3), 18), ((3, 1), 18), ((1, 4), 27), ((4, 1), 27), ((0, 2), 5), ((2, 0), 5), ((0, 4), 16), ((4, 0), 16), ((1, 2), 11), ((2, 1), 11), ((0, 3), 9), ((3, 0), 9)])

所以这里sorted函数排序的键值取itemgetter(1),即对支持度进行排序。

from operator import itemgetter
sorted_support = sorted(support.items(), key = itemgetter(1), reverse = True)
#输出支持度最高的前五条规则
for index in range(5):
    print("Rule #{0}".format(index + 1))
    A, B, = sorted_support[index][0]
    print_rule(A, B, support, confidence, features)

输出:

Rule #1
Rule: If a person buy apples they will also buy bananas
 - Support: 27
 - Conclusion: 0.628
Rule #2
Rule: If a person buy bananas they will also buy apples
 - Support: 27
 - Conclusion: 0.474
Rule #3
Rule: If a person buy milk they will also buy bananas
 - Support: 27
 - Conclusion: 0.519
Rule #4
Rule: If a person buy bananas they will also buy milk
 - Support: 27
 - Conclusion: 0.474
Rule #5
Rule: If a person buy cheese they will also buy apples
 - Support: 22
 - Conclusion: 0.564
  1. 按置信度排序
sorted_confidence = sorted(confidence.items(), key = itemgetter(1), reverse = True)
#输出置信度最高的前五条规则
for index in range(5):
    print("Rule #{0}".format(index + 1))
    A, B, = sorted_confidence[index][0]
    print_rule(A, B, support, confidence, features)

输出:

Rule #1
Rule: If a person buy apples they will also buy bananas
 - Support: 27
 - Conclusion: 0.628
Rule #2
Rule: If a person buy bread they will also buy bananas
 - Support: 16
 - Conclusion: 0.571
Rule #3
Rule: If a person buy cheese they will also buy apples
 - Support: 22
 - Conclusion: 0.564
Rule #4
Rule: If a person buy milk they will also buy bananas
 - Support: 27
 - Conclusion: 0.519
Rule #5
Rule: If a person buy cheese they will also buy bananas
 - Support: 20
 - Conclusion: 0.513

从排序结果看,“顾客买苹果。也会买香蕉”这条规则的支持度和置信度都最高。所以超市可以把香蕉和苹果摆放在一起,苹果可以搞促销,但苹果和香蕉不要同时搞促销,因为我们发现购买苹果的顾客中,接近63%的人即使不搞促销也会买香蕉——即使搞促销,也不会给销量带来太大的提升。

呼~终于写完啦,开心!继续加油啦!

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值