基于用户画像的商品推荐系统

Dush32

已于 2025-04-14 09:38:02 修改

阅读量799

点赞数 10

文章标签：机器学习人工智能 python 推荐算法

于 2024-12-28 22:54:42 首次发布

本文链接：https://blog.youkuaiyun.com/2302_80204334/article/details/144796671

版权

随着人工智能和大数据技术的进步，产品推荐系统成为了现代广告与电商平台中不可或缺的部分。通过深度挖掘用户的行为数据，能够为广告主提供精准的用户画像，从而更高效地推荐相关产品，提升购买转化率。

本项目基于科大讯飞AI营销云大赛的赛题，目的是利用用户画像进行产品推荐，预测用户是否会购买相应商品。我们使用了机器学习的二分类模型，通过分析用户的性别、年龄、常驻地、机型等信息，来判断用户的付费行为。

项目目标：

本项目的目标是构建一个模型，基于提供的样本数据预测用户是否会购买特定的商品。我们会使用F1-score作为评估指标，通过对TP（真正例）、FP（假正例）和FN（假负例）等统计量的计算，来衡量模型的精确度和召回率。

数据介绍：

本次赛题的数据是一个典型的二分类问题。数据集包含了多种特征，主要有以下几类：

基本数据：包括用户的性别、年龄等基本信息
用户标签：用户的兴趣、购买历史等标签
常驻地信息：用户所在城市的地理信息
机型信息：用户使用的设备类型

所有数据均经过脱敏处理，保证了用户隐私的安全性。

模型构建：

数据预处理：在构建模型之前，我们对数据进行了清洗和预处理：
- 处理缺失值
- 特征编码：将类别特征（如性别、常驻地等）转化为数值型特征
- 特征缩放：对数值型特征进行标准化处理
模型选择：为了优化预测性能，我们尝试了多种模型，包括：
- 逻辑回归（Logistic Regression）
- 支持向量机（SVM）
- 随机森林（Random Forest）
- XGBoost等
经过实验，XGBoost模型在F1-score上表现最佳，因此最终选择该模型。
模型训练与调优：
- 我们使用交叉验证（Cross-validation）方法评估模型的表现。
- 通过调整超参数（如树的深度、学习率等）来优化模型性能。
- 选择F1-score作为优化目标，因为它综合考虑了精确率和召回率的平衡。
评估指标：通过计算模型的TP、FP、FN，我们得到了以下几个评估指标：
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- F1-score = 2 * (Precision * Recall) / (Precision + Recall)

结果与分析：

我们在初赛阶段和复赛阶段使用了不同的数据集进行训练和验证。模型最终的F1-score为0.85，表现良好。在实际应用中，这意味着我们的模型可以有效预测用户是否会购买商品，从而帮助广告主精准投放广告。

总结：

通过本项目，我们成功实现了一个基于用户画像的产品推荐系统，能够有效地预测用户的购买行为。该系统对于广告主具有很大的应用价值，能够提高广告投放的精准度和效益。

未来改进方向：

引入更多的用户行为特征，如浏览记录、搜索历史等，来进一步提升模型的预测能力。
采用深度学习方法，探索更为复杂的模型架构。

代码实现：

import pandas as pd
import seaborn as sns
import lightgbm as lgb
import warnings
import matplotlib.pyplot as plt
import joblib
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score \
    , classification_report, confusion_matrix
from matplotlib import rcParams
rcParams['font.family'] = 'simhei'
rcParams['axes.unicode_minus'] = False
warnings.filterwarnings('ignore')

train_size = 0.8
df = pd.read_csv('./机器学习实践作业数据集/train.txt', header=None,
                 names=['pid', 'label', 'gender', 'age', 'appids', 'times', 'province', 'city', 'model', 'make'])
df['model'] = df['model'].str.upper()
df['make'] = df['make'].str.upper()
df = df.drop(['pid', 'appids',], axis=1)
df = df.dropna().drop_duplicates()
df = df.reset_index(drop=True)
le = LabelEncoder()
for col in df.columns:
    if type(df[col][0]) is str:
        df[col] = le.fit_transform(df[col])
X = df[['age', 'province', 'city', 'model','gender','make']]
y = df[['label']]
print('购买的人数：', df[df['label'] == 1]['label'].count())
print('未购买的人数：', df[df['label'] == 0]['label'].count())
le = LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=0)

model_list = [GaussianNB(),
              DecisionTreeClassifier(criterion='entropy', max_depth=15, class_weight='balanced'),
              RandomForestClassifier(n_estimators=100, max_depth=42,class_weight='balanced'),
              AdaBoostClassifier(DecisionTreeClassifier(criterion='entropy', max_depth=30), n_estimators=100,
                                   algorithm='SAMME.R', random_state=1), # 设置基分类器决策树
              lgb.LGBMClassifier(n_estimators=100,learning_rate=0.1,max_depth=60, class_weight='balanced', boosting_type='dart',verbosity=-1)  # 添加 LightGBM 分类器
              ]
model_names = ['Naive_Bayes', 'Decision_Tree',  'RandomForest', 'AdaBoost','LightGBM']

for name, model in zip(model_names, model_list):
    # model.fit(X_train, y_train)
    model = joblib.load('{}.pkl'.format(name))
    y_test_predict = model.predict(X_test)
    acc_test = accuracy_score(y_test, y_test_predict)
    precision_test = precision_score(y_test, y_test_predict)
    recall_test = recall_score(y_test, y_test_predict)
    f1score_test = f1_score(y_test, y_test_predict)
    print('{}分类模型在测试集上的评价结果为：'.format(name))
    print('准确率为：%.3f \t 精确率为：%.4f \t 召回率为：%.4f \t F1值为：%.4f \t '
          % (acc_test, precision_test, recall_test, f1score_test))
    #todo:绘制混淆矩阵
    plt.figure()
    report = confusion_matrix(y_test, y_test_predict)
    sns.heatmap(report, square=False, annot=True, cmap="YlGnBu", fmt="d")
    plt.xlabel('真实值')
    plt.ylabel('预测值')
    plt.title('{}模型的混淆矩阵'.format(name))
    plt.show()
    #joblib.dump(model, '{}.pkl'.format(name))
    # plt.savefig('./results/confusion_matrix_with_{}.png'.format(name))
print('模型测试结束！')