购买行为预测婴儿年龄：Apriori 和 KNN 的简单实现

最新推荐文章于 2024-07-05 16:38:39 发布

原创最新推荐文章于 2024-07-05 16:38:39 发布 · 1.1k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#Apriori #KNN

数据分析与挖掘专栏收录该内容

13 篇文章

订阅专栏

本文探讨了如何利用父母的购物行为数据预测婴儿年龄，通过数据预处理、特征工程和应用KNN算法进行预测，分析了数据局限性和模型表现。

数据来源：Baby Goods Info Data
目的：根据父母购买行为预测婴儿年龄
背景：给出的字段特征太少，都是数字型的分类变量，两个表联结后有效样本不过一千。尝试根据购买行为预测婴儿年龄

关联算法 Apriori

import pandas as pd
import numpy as np

filename = 'train.csv' # 联结后的表
data = pd.read_csv(filename)
test = data.copy()

test['age'] = test['age'] // 365 # 年龄由天数转成岁数

def func(x):
    if x <= 0:
        return 0
    elif x <= 3:
        return 1
    elif x <= 6:
        return 2
    elif x > 6:
        return 3
test['age'] = test['age'].apply(func) 
# 年龄分布太散，做分箱处理

# 拼接前将数据转成字符串格式
# cat_id 分布长尾效应明显，采用根类目 cat1
test['gender'] = test['gender'].astype('str')
test['cat1'] = test['cat1'].astype('str')
test['age'] = test['age'].astype('str')

test['concat1'] = test['cat1'].str.cat(test['gender'],sep='-') 
test['concat'] = test['concat1'].str.cat(test['age'],sep='->')

# 统计商品-性别-年龄出现频率
total = test.shape[0]
support = test['concat'].value_counts()
confidence = test['concat1'].value_counts()
lift = test['age'].value_counts()

# 统计商品-年龄出现频率
confidence1 = test['cat1'].value_counts()
support1 = test['cat1'].str.cat(test['age'],sep='-').value_counts()

# 计算商品-性别-年龄关联度
result = pd.DataFrame(columns=['rule','result','support','confidence','lift'])
for i in support.index:
    new  = dict()
    new['rule']  = i.split('->')[0]
    new['result']  = i.split('->')[1]
    new['support'] = support[i]/total
    new['confidence'] = support[i]/confidence[new['rule']]
    new['lift'] = new['confidence']/(lift[new['result']]/total)
    result = result.append(new,ignore_index=True)

# 计算商品-年龄关联度
result1 = pd.DataFrame(columns=['rule','result','support','confidence','lift'])
for i in support1.index:
    new  = dict()
    new['rule']  = i.split('-')[0]
    new['result']  = i.split('-')[1]
    new['support'] = support1[i]/total
    new['confidence'] = support1[i]/confidence1[new['rule']]
    new['lift'] = new['confidence']/(lift[new['result']]/total)
    result1 = result1.append(new,ignore_index=True)

商品-性别-年龄关联度

result.sort_values('result').reset_index(drop=True)

在这里插入图片描述
ps:这里给出的性别有0，1，2三种

商品-性别-年龄关联度

result1.sort_values('result').reset_index(drop=True)

在这里插入图片描述
结论：虽然数据集给出的交易记录有30000条，但是联结年龄表后有效记录只有不足1000条。样本量的不足加上年龄分布的分散，导致最终关联规则的支持度过低，置信度也毫无意义，无法得出有效结论。

KNN

# 数据拆分
test = test.drop(columns=['user_id','auction_id','cat_id','day','birthday'])
age = test['age'].values
data_size = test.shape[0]

# one-hot编码
test['cat1'] = test['cat1'].astype(str)
test['gender'] = test['gender'].astype(str)
new_data = pd.get_dummies(test).drop(columns=['age']).values

def KNN(x,y,k):
    # k 为设定的近邻数
	predict = list()
   
    for i in x:
        result = dict()
        # 计算欧式距离
        diffMat = np.tile(i, (data_size,1)) - new_data
        sqDiffMat = diffMat**2
        sqDistances = sqDiffMat.sum(axis=1)
        distances = sqDistances**0.5
        # 获取距离从小到大排列的索引值
        sortedDistIndicies = distances.argsort()
        # 对 k 个近邻统计计数
        for j in range(k):
            label = y[sortedDistIndicies[j]]
            result[label] = result.get(label,0) +1
        result = sorted(result.items(), key=lambda x:x[1], reverse=True)
        predict.append(result[0][0])
    
    return predict

pred_age = KNN(new_data,age,3)

# 模型评估
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score

print(accuracy_score(pred_age,age))
print(confusion_matrix(pred_age,age))
print(classification_report(pred_age,age))

数据处理方面，这里没有做训练数据和测试数据的划分，只是考虑到购买数量的数值比较大，对距离计算会有较大的影响，进行了几种数据缩放处理，但是最好的结果却是不缩放。

结果如下

# 精度
0.5753138075313807

# 混淆矩阵
[[226  79   8   1]
 [200 321  94  22]
 [  1   0   3   1]
 [  0   0   0   0]]

# 分类报告
              precision    recall  f1-score   support

           0       0.53      0.72      0.61       314
           1       0.80      0.50      0.62       637
           2       0.03      0.60      0.05         5
           3       0.00      0.00      0.00         0

    accuracy                           0.58       956
   macro avg       0.34      0.46      0.32       956
weighted avg       0.71      0.58      0.61       956

从结果来看，模型预测能力并不强。下一步改进可以考虑数据分布均衡，数据异常值以及 k 值的选取。

另外，直接调用KNN包，对比两者的结果

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(new_data,age)
clf.predict(new_data)

print(accuracy_score(clf_pre,age))
print(confusion_matrix(clf_pre,age))

输出如下：

0.4905857740585774

[[207,  79,   9,   1]
[155,  224,  58,   8]
[ 65,   97,  38,  15]
[  0,    0,   0,   0]]

手动实现的结果略好于工具包，在其他参数默认的情况下，KNN包中计算的也是欧式距离，最终结果的差异应该在于获取邻近点后判断输出结果上。

手动实现最后是判定方式是依据点出现频率来判断，而KNN包依据权重参数 weights来判定输出结果。

官方文档的解释是

weights : str or callable, optional (default = ‘uniform’)

weight function used in prediction. Possible values:

‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.

‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

[callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

具体可参考：

k最邻近算法——加权kNN