数据来源:Baby Goods Info Data
目的:根据父母购买行为预测婴儿年龄
背景:给出的字段特征太少,都是数字型的分类变量,两个表联结后有效样本不过一千。尝试根据购买行为预测婴儿年龄
关联算法 Apriori
import pandas as pd
import numpy as np
filename = 'train.csv' # 联结后的表
data = pd.read_csv(filename)
test = data.copy()
test['age'] = test['age'] // 365 # 年龄由天数转成岁数
def func(x):
if x <= 0:
return 0
elif x <= 3:
return 1
elif x <= 6:
return 2
elif x > 6:
return 3
test['age'] = test['age'].apply(func)
# 年龄分布太散,做分箱处理
# 拼接前将数据转成字符串格式
# cat_id 分布长尾效应明显,采用根类目 cat1
test['gender'] = test['gender'].astype('str')
test['cat1'] = test['cat1'].astype('str')
test['age'] = test['age'].astype('str')
test['concat1'] = test['cat1'].str.cat(test['gender'],sep='-')
test['concat'] = test['concat1'].str.cat(test['age'],sep='->')
# 统计商品-性别-年龄出现频率
total = test.shape[0]
support = test['concat'].value_counts()
confidence = test['concat1'].value_counts()
lift = test['age'].value_counts()
# 统计商品-年龄出现频率
confidence1 = test['cat1'].value_counts()
support1 = test['cat1'].str.cat(test['age'],sep='-').value_counts()
# 计算商品-性别-年龄关联度
result = pd.DataFrame(columns=['rule','result','support','confidence','lift'])
for i in support.index:
new = dict()
new['rule'] = i.split('->')[0]
new['result'] = i.split('->')[1]
new['support'] = support[i]/total
new['confidence'] = support[i]/confidence[new['rule']]
new['lift'] = new['confidence']/(lift[new['result']]/total)
result = result.append(new,ignore_index=True)
# 计算商品-年龄关联度
result1 = pd.DataFrame(columns=['rule','result','support','confidence','lift'])
for i in support1.index:
new = dict()
new['rule'] = i.split('-')[0]
new['result'] = i.split('-')[1]
new['support'] = support1[i]/total
new['confidence'] = support1[i]/confidence1[new['rule']]
new['lift'] = new['confidence']/(lift[new['result']]/total)
result1 = result1.append(new,ignore_index=True)
商品-性别-年龄关联度
result.sort_values('result').reset_index(drop=True)

ps:这里给出的性别有0,1,2三种
商品-性别-年龄关联度
result1.sort_values('result').reset_index(drop=True)

结论:虽然数据集给出的交易记录有30000条,但是联结年龄表后有效记录只有不足1000条。样本量的不足加上年龄分布的分散,导致最终关联规则的支持度过低,置信度也毫无意义,无法得出有效结论。
KNN
# 数据拆分
test = test.drop(columns=['user_id','auction_id','cat_id','day','birthday'])
age = test['age'].values
data_size = test.shape[0]
# one-hot编码
test['cat1'] = test['cat1'].astype(str)
test['gender'] = test['gender'].astype(str)
new_data = pd.get_dummies(test).drop(columns=['age']).values
def KNN(x,y,k):
# k 为设定的近邻数
predict = list()
for i in x:
result = dict()
# 计算欧式距离
diffMat = np.tile(i, (data_size,1)) - new_data
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
# 获取距离从小到大排列的索引值
sortedDistIndicies = distances.argsort()
# 对 k 个近邻统计计数
for j in range(k):
label = y[sortedDistIndicies[j]]
result[label] = result.get(label,0) +1
result = sorted(result.items(), key=lambda x:x[1], reverse=True)
predict.append(result[0][0])
return predict
pred_age = KNN(new_data,age,3)
# 模型评估
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
print(accuracy_score(pred_age,age))
print(confusion_matrix(pred_age,age))
print(classification_report(pred_age,age))
数据处理方面,这里没有做训练数据和测试数据的划分,只是考虑到购买数量的数值比较大,对距离计算会有较大的影响,进行了几种数据缩放处理,但是最好的结果却是不缩放。
结果如下
# 精度
0.5753138075313807
# 混淆矩阵
[[226 79 8 1]
[200 321 94 22]
[ 1 0 3 1]
[ 0 0 0 0]]
# 分类报告
precision recall f1-score support
0 0.53 0.72 0.61 314
1 0.80 0.50 0.62 637
2 0.03 0.60 0.05 5
3 0.00 0.00 0.00 0
accuracy 0.58 956
macro avg 0.34 0.46 0.32 956
weighted avg 0.71 0.58 0.61 956
从结果来看,模型预测能力并不强。下一步改进可以考虑数据分布均衡,数据异常值以及 k 值的选取。
另外,直接调用KNN包,对比两者的结果
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
clf.fit(new_data,age)
clf.predict(new_data)
print(accuracy_score(clf_pre,age))
print(confusion_matrix(clf_pre,age))
输出如下:
0.4905857740585774
[[207, 79, 9, 1]
[155, 224, 58, 8]
[ 65, 97, 38, 15]
[ 0, 0, 0, 0]]
手动实现的结果略好于工具包,在其他参数默认的情况下,KNN包中计算的也是欧式距离,最终结果的差异应该在于获取邻近点后判断输出结果上。
手动实现最后是判定方式是依据点出现频率来判断,而KNN包依据权重参数 weights来判定输出结果。
官方文档的解释是
weights : str or callable, optional (default = ‘uniform’)
weight function used in prediction. Possible values:
‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
[callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
具体可参考:
本文探讨了如何利用父母的购物行为数据预测婴儿年龄,通过数据预处理、特征工程和应用KNN算法进行预测,分析了数据局限性和模型表现。
2012

被折叠的 条评论
为什么被折叠?



