1.fit_transform() 2.get_dummies 3.pd.columns 4.Python字符串格式化--format()方法

本文介绍了Python中sklearn库的数据预处理方法,包括fit_transform()与transform()的区别及其应用场景,并详细讲解了如何使用pandas进行one-hot编码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.Python: sklearn库中数据预处理函数fit_transform()和transform()的区别

 

敲《Python机器学习及实践》上的code的时候,对于数据预处理中涉及到的fit_transform()函数和transform()函数之间的区别很模糊,查阅了很多资料,这里整理一下:

 

涉及到这两个函数的代码如下:

 

 
# 从sklearn.preprocessing导入StandardScaler

from sklearn.preprocessing import StandardScaler

# 标准化数据,保证每个维度的特征数据方差为1,均值为0,使得预测结果不会被某些维度过大的特征值而主导

ss = StandardScaler()

# fit_transform()先拟合数据,再标准化

X_train = ss.fit_transform(X_train)

# transform()数据标准化

X_test = ss.transform(X_test)


我们先来看一下这两个函数的API以及参数含义:

 

1、fit_transform()函数

即fit_transform()的作用就是先拟合数据,然后转化它将其转化为标准形式

2、transform()函数

即tranform()的作用是通过找中心和缩放等实现标准化

 

到了这里,我们似乎知道了两者的一些差别,就像名字上的不同,前者多了一个fit数据的步骤,那为什么在标准化数据的时候不使用fit_transform()函数呢?

原因如下:

为了数据归一化(使特征数据方差为1,均值为0),我们需要计算特征数据的均值μ和方差σ^2,再使用下面的公式进行归一化:

我们在训练集上调用fit_transform(),其实找到了均值μ和方差σ^2,即我们已经找到了转换规则,我们把这个规则利用在训练集上,同样,我们可以直接将其运用到测试集上(甚至交叉验证集),所以在测试集上的处理,我们只需要标准化数据而不需要再次拟合数据。用一幅图展示如下:

 

2.pandas使用get_dummies进行one-hot编码

官网:http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html

 

pandas.get_dummies(dataprefix=Noneprefix_sep='_'dummy_na=Falsecolumns=Nonesparse=Falsedrop_first=False)[source]

Convert categorical variable into dummy/indicator variables

Parameters:

data : array-like, Series, or DataFrame

prefix : string, list of strings, or dict of strings, default None

String to append DataFrame column namesPass a list with length equal to the number of columnswhen calling get_dummies on a DataFrame. Alternatively, prefixcan be a dictionary mapping column names to prefixes.

prefix_sep : string, default ‘_’

If appending prefix, separator/delimiter to use. Or pass alist or dictionary as with prefix.

dummy_na : bool, default False

Add a column to indicate NaNs, if False NaNs are ignored.

columns : list-like, default None

Column names in the DataFrame to be encoded.If columns is None then all the columns withobject or category dtype will be converted.

sparse : bool, default False

Whether the dummy columns should be sparse or not. ReturnsSparseDataFrame if data is a Series or if all columns are included.Otherwise returns a DataFrame with some SparseBlocks.

drop_first : bool, default False

Whether to get k-1 dummies out of k categorical levels by removing thefirst level.

New in version 0.18.0.

Returns

——-

dummies : DataFrame or SparseDataFrame

See also

Series.str.get_dummies

Examples

>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1)
   a  b
0  1  0
1  0  1
2  0  0
>>> pd.get_dummies(s1, dummy_na=True)
   a  b  NaN
0  1  0    0
1  0  1    0
2  0  0    1
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
...                    'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
   C  col1_a  col1_b  col2_a  col2_b  col2_c
0  1       1       0       0       1       0
1  2       0       1       1       0       0
2  3       1       0       0       0       1
>>> pd.get_dummies(pd.Series(list('abcaa')))
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
4  1  0  0
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
   b  c
0  0  0
1  1  0
2  0  1
3  0  0
4  0  0
离散特征的编码分为两种情况:
 
1、离散特征的取值之间没有大小的意义,比如color:[red,blue],那么就使用one-hot编码
 
2、离散特征的取值有大小的意义,比如size:[X,XL,XXL],那么就使用数值的映射{X:1,XL:2,XXL:3}
 
使用pandas可以很方便的对离散型特征进行one-hot编码
 
[python] view plain copy
 
    import pandas as pd  
    df = pd.DataFrame([  
                ['green', 'M', 10.1, 'class1'],   
                ['red', 'L', 13.5, 'class2'],   
                ['blue', 'XL', 15.3, 'class1']])  
      
    df.columns = ['color', 'size', 'prize', 'class label']  
      
    size_mapping = {  
               'XL': 3,  
               'L': 2,  
               'M': 1}  
    df['size'] = df['size'].map(size_mapping)  
      
    class_mapping = {label:idx for idx,label in enumerate(set(df['class label']))}  
    df['class label'] = df['class label'].map(class_mapping)  
 
说明:对于有大小意义的离散特征,直接使用映射就可以了,{'XL':3,'L':2,'M':1}
 
Using the get_dummies will create a new column for every unique string in a certain column:使用get_dummies进行one-hot编码
[python] view plain copy
 
    pd.get_dummies(df)  
 

3.pd.columns和pd.columns.tolist

print(food_info.columns)
# 输出:输出全部的列名,而不是用省略号代替

Index(['NDB_No', 'Shrt_Desc', 'Water_(g)', 'Energ_Kcal', 'Protein_(g)', 'Lipid_Tot_(g)', 'Ash_(g)', 'Carbohydrt_(g)', 'Fiber_TD_(g)', 'Sugar_Tot_(g)', 'Calcium_(mg)', 'Iron_(mg)', 'Magnesium_(mg)', 'Phosphorus_(mg)', 'Potassium_(mg)', 'Sodium_(mg)', 'Zinc_(mg)', 'Copper_(mg)', 'Manganese_(mg)', 'Selenium_(mcg)', 'Vit_C_(mg)', 'Thiamin_(mg)', 'Riboflavin_(mg)', 'Niacin_(mg)', 'Vit_B6_(mg)', 'Vit_B12_(mcg)', 'Vit_A_IU', 'Vit_A_RAE', 'Vit_E_(mg)', 'Vit_D_mcg', 'Vit_D_IU', 'Vit_K_(mcg)', 'FA_Sat_(g)', 'FA_Mono_(g)', 'FA_Poly_(g)', 'Cholestrl_(mg)'], dtype='object')
复制代码
可以使用tolist()函数转化为list

food_info.columns.tolist()

4.Python字符串格式化--format()方法

1.简单运用

字符串类型格式化采用format()方法,基本使用格式是:
     <模板字符串>.format(<逗号分隔的参数>)
调用format()方法后会返回一个新的字符串,参数从0 开始编号。
"{}:计算机{}的CPU 占用率为{}%。".format("2016-12-31","PYTHON",10)
Out[10]: '2016-12-31:计算机PYTHON的CPU 占用率为10%。'
format()方法可以非常方便地连接不同类型的变量或内容,如果需要输出大括号,采用{{表示{,}}表示},例如:
"{}{}{}".format("圆周率是",3.1415926,"...")
Out[11]: '圆周率是3.1415926...'
 
"圆周率{{{1}{2}}}是{0}".format("无理数",3.1415926,"...")
Out[12]: '圆周率{3.1415926...}是无理数'
 
s="圆周率{{{1}{2}}}是{0}" #大括号本身是字符串的一部分
 
s
Out[14]: '圆周率{{{1}{2}}}是{0}'
 
s.format("无理数",3.1415926,"...") #当调用format()时解析大括号
Out[15]: '圆周率{3.1415926...}是无理数'

 

#!/usr/bin/env python # coding: utf-8 import pandas as pd import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import seaborn as sns # 导入seaborn库 from sklearn.preprocessing import MinMaxScaler, OneHotEncoder from sklearn.metrics import silhouette_score # 导入轮廓系数指标 from sklearn.cluster import KMeans # KMeans模块 ## 设置属性防止中文乱码 mpl.rcParams[&#39;font.sans-serif&#39;] = [u&#39;SimHei&#39;] mpl.rcParams[&#39;axes.unicode_minus&#39;] = False # 加载数据 raw_data = pd.read_csv(r&#39;./ad_performance.csv&#39;) # 数据审查 print("数据框的前几行:") print(raw_data.head(2)) print("\n数据类型分布:") print(raw_data.info()) print("\n原始数据基本描述性信息:") print(raw_data.describe().round(2).T) # 查看缺失值情况 na_cols = raw_data.isnull().any(axis=0) print("\n每一列是否具有缺失值:") print(na_cols) print("\n具有缺失值的行总记录数:") print(raw_data.isnull().sum().sort_values(ascending=False)) # 变量之间的相关性分析 print("\n原始数据相关性信息:") corr = raw_data.corr(numeric_only=True).round(2) print(corr.T) # 热力图 plt.figure(figsize=(10, 8)) sns.heatmap(corr, cmap=&#39;coolwarm&#39;, annot=True, fmt=".2f") plt.title(&#39;变量相关性热力图&#39;) plt.show() # 删除平均停留时间列 raw_data2 = raw_data.drop([&#39;平均停留时间&#39;], axis=1) # 类别变量取值 cols = ["素材类型", "广告类型", "合作方式", "广告尺寸", "广告卖点"] for x in cols: data = raw_data2[x].unique() print(f"\n变量【{x}】的取值有:\n{data}") # 字符串分类独热编码处理 ohe_matrix1 = pd.get_dummies(raw_data2[cols]) print("\n独热编码后的矩阵头5行:") print(ohe_matrix1.head(5)) # 数据标准化 sacle_matrix = raw_data2.iloc[:, 1:7] # 获得要转换的矩阵 model_scaler = MinMaxScaler() # 建立MinMaxScaler模型对象 data_scaled = model_scaler.fit_transform(sacle_matrix) # MinMaxScaler标准化处理 print("\n标准化后的数据样本(前5行):") print(pd.DataFrame(data_scaled.round(2), columns=sacle_matrix.columns).head(5)) # 合并所有维度 X = np.hstack((data_scaled, ohe_matrix1.values)) # 建立模型 score_list = list() # 用来存储每个K下模型的平局轮廓系数 silhouette_int = -1 # 初始化的平均轮廓系数阀值 best_kmeans = None cluster_labels_k = None for n_clusters in range(2, 8): # 遍历从2到7几个有限组 model_kmeans = KMeans(n_clusters=n_clusters, random_state=42) # 建立聚类模型对象 labels_tmp = model_kmeans.fit_predict(X) # 训练聚类模型 silhouette_tmp = silhouette_score(X, labels_tmp) # 得到每个K下的平均轮廓系数 if silhouette_tmp > silhouette_int: # 如果平均轮廓系数更高 best_k = n_clusters # 保存最好的K silhouette_int = silhouette_tmp # 保存平均轮廓得分 best_kmeans = model_kmeans # 保存模型实例对象 cluster_labels_k = labels_tmp # 保存聚类标签 score_list.append([n_clusters, silhouette_tmp]) # 将每次K及其得分追加到列表 print(&#39;\n{:*^60}&#39;.format(&#39;K值对应的轮廓系数:&#39;)) print(np.array(score_list)) # 打印输出所有K下的详细得分 print(f&#39;\n最优的K值是:{best_k} \n对应的轮廓系数是:{silhouette_int}&#39;) # 将原始数据与聚类标签整合 cluster_labels = pd.DataFrame(cluster_labels_k, columns=[&#39;clusters&#39;]) # 获得训练集下的标签信息 merge_data = pd.concat((raw_data2, cluster_labels), axis=1) # 将原始处理过的数据跟聚类标签整合 print("\n合并后的数据框前几行:") print(merge_data.head()) # 计算每个聚类类别下的样本量和样本占比 clustering_count = merge_data.groupby(&#39;clusters&#39;).size().reset_index(name=&#39;counts&#39;) # 计算每个聚类类别的样本量 clustering_ratio = clustering_count.copy() clustering_ratio[&#39;percentage&#39;] = (clustering_ratio[&#39;counts&#39;] / len(merge_data)).round(2) # 计算每个聚类类别的样本量占比 print("\n每个聚类类别的样本量:") print(clustering_count) print("\n#" * 30) print("\n每个聚类类别的样本量占比:") print(clustering_ratio) # 计算各个聚类类别内部最显著特征值 cluster_features = [] # 空列表,用于存储最终合并后的所有特征信息 for line in range(best_k): # 读取每个类索引 label_data = merge_data[merge_data[&#39;clusters&#39;] == line] # 获得特定类的数据 part1_data = label_data.iloc[:, 1:7] # 获得数值型数据特征 part1_desc = part1_data.describe().round(3) # 得到数值型特征的描述性统计信息 merge_data1 = part1_desc.iloc[2, :] # 得到数值型特征的均值 part2_data = label_data.iloc[:, 7:-1] # 获得字符串型数据特征 part2_desc = part2_data.describe(include=&#39;all&#39;) # 获得字符串型数据特征的描述性统计信息 merge_data2 = part2_desc.iloc[2, :] # 获得字符串型数据特征的最频繁值 merge_line = pd.concat((merge_data1, merge_data2), axis=0) # 将数值型和字符串型典型特征沿行合并 cluster_features.append(merge_line) # 将每个类别下的数据特征追加到列表 # 输出完整的类别特征信息 cluster_pd = pd.DataFrame(cluster_features).T # 将列表转化为矩阵 print(&#39;\n{:*^60}&#39;.format(&#39;每个类别主要的特征:&#39;)) all_cluster_set = pd.concat((clustering_count, clustering_ratio[[&#39;percentage&#39;]], cluster_pd), axis=1) # 将每个聚类类别的所有信息合并 print(all_cluster_set) # 曲线图 - 每个集群的日均UV变化趋势 plt.figure(figsize=(12, 6)) for i in range(best_k): cluster_data = merge_data[merge_data[&#39;clusters&#39;] == i] plt.plot(cluster_data.index, cluster_data[&#39;日均UV&#39;], label=f&#39;Cluster {i+1}&#39;) plt.xlabel(&#39;Index&#39;) plt.ylabel(&#39;日均UV&#39;) plt.title(&#39;各聚类类别日均UV变化趋势&#39;) plt.legend() plt.grid(True) plt.show() # 直方图 - 各集群的订单转化率分布 plt.figure(figsize=(12, 6)) for i in range(best_k): cluster_data = merge_data[merge_data[&#39;clusters&#39;] == i] plt.hist(cluster_data[&#39;订单转化率&#39;], bins=10, alpha=0.5, label=f&#39;Cluster {i+1}&#39;) plt.xlabel(&#39;订单转化率&#39;) plt.ylabel(&#39;Frequency&#39;) plt.title(&#39;各聚类类别订单转化率分布&#39;) plt.legend() plt.grid(True) plt.show() # 扇形图 - 各集群的样本占比 plt.figure(figsize=(8, 8)) plt.pie(clustering_ratio[&#39;percentage&#39;], labels=[f&#39;Cluster {i+1}&#39; for i in range(best_k)], autopct=&#39;%1.1f%%&#39;, startangle=140) plt.title(&#39;各聚类类别样本占比&#39;) plt.axis(&#39;equal&#39;) # Equal aspect ratio ensures that pie is drawn as a circle. plt.show() # 雷达图 - 各集群的主要特征对比 num_sets = cluster_pd.iloc[:best_k, :].astype(np.float64) # 获取要展示的数据 labels = num_sets.index # 设置要展示的数据标签 colors = [&#39;#FF9999&#39;,&#39;#66B3FF&#39;,&#39;#99FF99&#39;,&#39;#FFCC99&#39;, &#39;#CCCCFF&#39;, &#39;#FF99CC&#39;] # 定义不同类别的颜色 angles = np.linspace(0, 2 * np.pi, len(labels), endpoint=False) # 计算各个区间的角度 angles = np.concatenate((angles, [angles[0]])) # 建立相同首尾字段以便于闭合 # 标准化数据 scaler = MinMaxScaler() num_sets_max_min = scaler.fit_transform(num_sets.T).T # 转置后标准化再转置回来 print("\n每个类别主要的数值特征(标准化后):") print(num_sets_max_min) # 画图 fig = plt.figure(figsize=(10, 10)) # 建立画布 ax = fig.add_subplot(111, polar=True) # 增加子网格,注意polar参数 # 画雷达图 for i in range(len(num_sets_max_min)): # 循环每个类别 data_tmp = num_sets_max_min[i, :] # 获得对应类数据 data = np.concatenate((data_tmp, [data_tmp[0]])) # 建立相同首尾字段以便于闭合 ax.plot(angles, data, &#39;o-&#39;, c=colors[i], label=f"第{i+1}类渠道") # 画线 ax.fill(angles, data, alpha=0.25, color=colors[i]) # 设置图像显示格式 ax.set_thetagrids(angles[:-1] * 180 / np.pi, labels, fontproperties="SimHei") # 设置极坐标轴 ax.set_title("各聚类类别显著特征对比", pad=20, fontproperties="SimHei") # 设置标题放置 ax.set_rlim(0, 1.2) # 设置坐标轴尺度范围 ax.grid(True) plt.legend(loc="upper right", bbox_to_anchor=(1.2, 1.0), prop={&#39;family&#39;: &#39;SimHei&#39;}) plt.tight_layout(rect=[0, 0, 0.8, 1]) plt.show() # 数据结论 print("\n数据结论:") print("所有的渠道被分为{}各类别,每个类别的样本量分别为:{}\n对应占比分别为:{}".format( best_k, clustering_count[&#39;counts&#39;].values.tolist(), clustering_ratio[&#39;percentage&#39;].values.tolist()))
06-12
import streamlit as st import pandas as pd import numpy as np import joblib import os import time import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl import matplotlib.font_manager as fm import seaborn as sns from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, roc_auc_score, confusion_matrix from sklearn.preprocessing import StandardScaler from imblearn.over_sampling import SMOTE from sklearn.impute import SimpleImputer import warnings warnings.filterwarnings(“ignore”) plt.rcParams[‘font.sans-serif’] = [‘SimHei’] plt.rcParams[‘axes.unicode_minus’] = False # 正确显示负号 页面设置 st.set_page_config( page_title=“风控违约预测系统”, page_icon=“📊”, layout=“wide”, initial_sidebar_state=“expanded” ) 自定义CSS样式 st.markdown(“”" <style> .stApp { background: linear-gradient(135deg, #f5f7fa 0%, #e4edf5 100%); font-family: &#39;Helvetica Neue&#39;, Arial, sans-serif; } .header { background: linear-gradient(90deg, #2c3e50 0%, #4a6491 100%); color: white; padding: 1.5rem; border-radius: 0.75rem; box-shadow: 0 4px 12px rgba(0,0,0,0.1); margin-bottom: 2rem; } .card { background: white; border-radius: 0.75rem; padding: 1.5rem; margin-bottom: 1.5rem; box-shadow: 0 4px 12px rgba(0,0,0,0.08); transition: transform 0.3s ease; } .card:hover { transform: translateY(-5px); box-shadow: 0 6px 16px rgba(0,0,0,0.12); } .stButton button { background: linear-gradient(90deg, #3498db 0%, #1a5276 100%) !important; color: white !important; border: none !important; border-radius: 0.5rem; padding: 0.75rem 1.5rem; font-size: 1rem; font-weight: 600; transition: all 0.3s ease; width: 100%; } .stButton button:hover { transform: scale(1.05); box-shadow: 0 4px 8px rgba(52, 152, 219, 0.4); } .feature-box { background: linear-gradient(135deg, #e3f2fd 0%, #bbdefb 100%); border-radius: 0.75rem; padding: 1.5rem; margin-bottom: 1.5rem; } .result-box { background: linear-gradient(135deg, #e8f5e9 0%, #c8e6c9 100%); border-radius: 0.75rem; padding: 1.5rem; margin-top: 1.5rem; } .model-box { background: linear-gradient(135deg, #fff3e0 0%, #ffe0b2 100%); border-radius: 0.75rem; padding: 1.5rem; margin-top: 1.5rem; } .stProgress > div > div > div { background: linear-gradient(90deg, #2ecc71 0%, #27ae60 100%) !important; } .metric-card { background: white; border-radius: 0.75rem; padding: 1rem; text-align: center; box-shadow: 0 4px 8px rgba(0,0,0,0.06); } .metric-value { font-size: 1.8rem; font-weight: 700; color: #2c3e50; } .metric-label { font-size: 0.9rem; color: #7f8c8d; margin-top: 0.5rem; } .highlight { background: linear-gradient(90deg, #ffeb3b 0%, #fbc02d 100%); padding: 0.2rem 0.5rem; border-radius: 0.25rem; font-weight: 600; } .stDataFrame { border-radius: 0.75rem; box-shadow: 0 4px 8px rgba(0,0,0,0.06); } .risk-high { background-color: #ffcdd2 !important; color: #c62828 !important; font-weight: 700; } .risk-medium { background-color: #fff9c4 !important; color: #f57f17 !important; font-weight: 600; } .risk-low { background-color: #c8e6c9 !important; color: #388e3c !important; } </style> “”", unsafe_allow_html=True) def preprocess_loan_data(data_old): “”" 训练时数据预处理函数,返回处理后的数据和推理时需要的参数 参数: data_old: 原始训练数据 (DataFrame) 返回: processed_data: 预处理后的训练数据 (DataFrame) preprocessor_params: 推理时需要的预处理参数 (dict) “”" # 1. 创建原始数据副本 loan_data = data_old.copy() # 2. 保存要删除的列列表 drop_list = [‘id’,‘member_id’, ‘term’, ‘pymnt_plan’, ‘initial_list_status’, ‘sub_grade’, ‘emp_title’, ‘issue_d’, ‘title’, ‘zip_code’, ‘addr_state’, ‘earliest_cr_line’, ‘last_pymnt_d’, ‘last_credit_pull_d’, ‘url’,‘desc’,‘next_pymnt_d’] loan_data.drop([col for col in drop_list if col in loan_data.columns], axis=1, inplace=True, errors=‘ignore’) # 3. 删除缺失值超过90%的列 #todo 自己补齐删除代码 missing_ratio = loan_data.isnull().sum() / len(loan_data) loan_data.drop(missing_ratio[missing_ratio > 0.9].index, axis=1, inplace=True, errors=‘ignore’) # 4. 删除值全部相同的列 #todo 自己补齐删除代码 constant_cols = loan_data.columns[loan_data.nunique() <= 1] loan_data.drop(constant_cols, axis=1, inplace=True, errors=‘ignore’) # 5. 处理特殊数值列 loans = loan_data # 修正变量名 loans[“int_rate”] = loans[“int_rate”].astype(str).str.rstrip(‘%’).astype(“float”) loans[“revol_util”] = loans[“revol_util”].astype(str).str.rstrip(‘%’).astype(“float”) # 6. 缺失值处理 ## 识别分类列和数值列 objectColumns = loans.select_dtypes(include=[“object”]).columns.tolist() numColumns = loans.select_dtypes(include=[np.number]).columns.tolist() ## 保存分类列的列名 categorical_columns = objectColumns.copy() ## 填充分类变量缺失值 loans[objectColumns] = loans[objectColumns].fillna(“Unknown”) ## 填充数值变量缺失值并保存均值 imr = SimpleImputer(missing_values=np.nan, strategy=“mean”) loans[numColumns] = imr.fit_transform(loans[numColumns]) # 保存数值列的均值 numerical_means = {col: imr.statistics_[i] for i, col in enumerate(numColumns)} # 8. 特征衍生 loans[“installment_feat”] = loans[“installment”] / ((loans[“annual_inc”] + 1) / 12) # 9. 目标变量编码 status_mapping = { “Current”: 0, “Issued”: 0, “Fully Paid”: 0, “In Grace Period”: 1, “Late (31-120 days)”: 1, “Late (16-30 days)”: 1, “Charged Off”: 1, “Does not meet the credit policy. Status:Charged Off”: 1, “Does not meet the credit policy. Status:Fully Paid”: 0, “Default”: 0 } loans[“loan_status”] = loans[“loan_status”].map(status_mapping) # 10. 有序特征映射 mapping_dict = { “emp_length”: { “10+ years”: 10, “9 years”: 9, “8 years”: 8, “7 years”: 7, “6 years”: 6, “5 years”: 5, “4 years”: 4, “3 years”: 3, “2 years”: 2, “1 year”: 1, “< 1 year”: 0, “Unknown”: 0 }, “grade”: { “A”: 1, “B”: 2, “C”: 3, “D”: 4, “E”: 5, “F”: 6, “G”: 7 } } loans = loans.replace(mapping_dict) # 11. One-hot编码 n_columns = [“home_ownership”, “verification_status”, “purpose”, “application_type”] dummy_df = pd.get_dummies(loans[n_columns], drop_first=False) loans = pd.concat([loans, dummy_df], axis=1) loans.drop(n_columns, axis=1, inplace=True) # 保存One-hot编码后的列名 onehot_columns = n_columns onehot_encoder_columns = dummy_df.columns.tolist() # 12. 特征缩放 # 识别需要缩放的数值列 numeric_cols = loans.select_dtypes(include=[“int”, “float”]).columns.tolist() if ‘loan_status’ in numeric_cols: numeric_cols.remove(‘loan_status’) # 创建并拟合缩放器 sc = StandardScaler() if numeric_cols: loans[numeric_cols] = sc.fit_transform(loans[numeric_cols]) # 保存缩放列名 scaled_columns = numeric_cols # 13. 保存最终列结构(在SMOTE之前) #final_columns = loans.columns.tolist().remove(‘loan_status’) final_columns = loans.columns[loans.columns != ‘loan_status’].tolist() # 14. 处理不平衡数据(SMOTE过采样) X = loans.drop(“loan_status”, axis=1) y = loans[“loan_status”] os = SMOTE(random_state=42) X_res, y_res = os.fit_resample(X, y) # 15. 合并为最终DataFrame processed_data = pd.concat([X_res, y_res], axis=1) processed_data.columns = list(X.columns) + [“loan_status”] # 16. 创建推理时需要的参数字典 preprocessor_params = { # 1. 删除的列 ‘drop_list’: drop_list, # 2. 分类列缺失值填充 ‘categorical_columns’: categorical_columns, # 3. 数值列填充均值 ‘numerical_means’: numerical_means, # 4. 有序特征映射 ‘mapping_dict’: mapping_dict, # 5. One-hot配置 ‘onehot_columns’: onehot_columns, ‘onehot_encoder_columns’: onehot_encoder_columns, # 6. 缩放器及缩放列 ‘scaler’: sc, # 已拟合的StandardScaler实例 ‘scaled_columns’: scaled_columns, # 7. 最终列结构(训练后的列顺序) ‘final_columns’: final_columns } return processed_data, preprocessor_params def preprocess_loan_data_inference(data_old, preprocessor_params): “”" 推理时数据处理函数 参数: data_old: 原始推理数据 (DataFrame) preprocessor_params: 从训练过程保存的预处理参数 (dict) 返回: processed_data: 预处理后的推理数据 (DataFrame) “”" # 1. 复制数据避免污染原始数据 loanss = data_old.copy() # 2. 删除训练时确定的列 drop_list = preprocessor_params[‘drop_list’] loans = loanss.drop(columns=[col for col in drop_list if col in loanss.columns], axis=1, errors=‘ignore’) # 3. 处理特殊数值列(百分比转换) if ‘int_rate’ in loans: loans[“int_rate”] = loans[“int_rate”].astype(str).str.rstrip(‘%’).astype(“float”) if ‘revol_util’ in loans: loans[“revol_util”] = loans[“revol_util”].astype(str).str.rstrip(‘%’).astype(“float”) # 4. 特征衍生(使用训练时相同公式) if ‘installment’ in loans and ‘annual_inc’ in loans: loans[“installment_feat”] = loans[“installment”] / ((loans[“annual_inc”] + 1) / 12) # 5. 有序特征映射(使用训练时的映射字典) mapping_dict = preprocessor_params[‘mapping_dict’] for col, mapping in mapping_dict.items(): if col in loans: # 处理未知值,默认为0 loans[col] = loans[col].map(mapping).fillna(0).astype(int) # 6. 缺失值处理(使用训练时保存的策略) # 分类变量 cat_cols = preprocessor_params[‘categorical_columns’] for col in cat_cols: if col in loans: loans[col] = loans[col].fillna(“Unknown”) # 数值变量(使用训练时保存的均值) num_means = preprocessor_params[‘numerical_means’] for col, mean_value in num_means.items(): if col in loans: loans[col] = loans[col].fillna(mean_value) # 7. One-hot编码(对齐训练时的列结构) n_columns = preprocessor_params[‘onehot_columns’] expected_dummy_columns = preprocessor_params[‘onehot_encoder_columns’] # 创建空DataFrame用于存储结果 dummy_df = pd.DataFrame(columns=expected_dummy_columns) # 为每个分类列生成dummy变量 for col in n_columns: if col in loans: # 为当前列生成dummies col_dummies = pd.get_dummies(loans[col], prefix=col) # 对齐训练时的列结构 for expected_col in expected_dummy_columns: if expected_col in col_dummies: dummy_df[expected_col] = col_dummies[expected_col] else: # 如果该列不存在,则创建全0列 dummy_df[expected_col] = 0 # 合并dummy变量 loans = pd.concat([loans, dummy_df], axis=1) # 删除原始分类列 loans.drop(columns=[col for col in n_columns if col in loans.columns], inplace=True, errors=‘ignore’) # 8. 特征缩放(使用训练时的缩放器参数) sc = preprocessor_params[‘scaler’] scaled_cols = [col for col in preprocessor_params[‘scaled_columns’] if col in loans.columns] if scaled_cols: loans[scaled_cols] = sc.transform(loans[scaled_cols]) # 9. 对齐最终特征列(确保与训练数据相同) final_columns = preprocessor_params[‘final_columns’] # 添加缺失列(用0填充) for col in final_columns: if col not in loans.columns: loans[col] = 0 # 移除多余列并保持顺序 processed_data = loans[final_columns] print(loans.columns) return processed_data 标题区域 st.markdown(“”" <div class="header"> <h1 style=&#39;text-align: center; margin: 0;&#39;>风控违约预测系统</h1> <p style=&#39;text-align: center; margin: 0.5rem 0 0; font-size: 1.1rem;&#39;>基于机器学习的信贷风险评估与预测</p> </div> """, unsafe_allow_html=True) 页面布局 col1, col2 = st.columns([1, 1.5]) 左侧区域 - 图片和简介 with col1: st.markdown(“”" 智能风控系统 利用先进机器学习技术预测信贷违约风险 “”", unsafe_allow_html=True) 使用在线图片作为占位符 st.image(“https://images.unsplash.com/photo-1553877522-43269d4ea984?ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D&auto=format&fit=crop&w=1200&q=80”, caption=“智能风控系统示意图”, use_column_width=True) st.markdown(“”" 📈 系统功能 客户违约风险预测 高精度机器学习模型 可视化风险评估 批量数据处理 “”", unsafe_allow_html=True) 右侧区域 - 功能选择 with col2: st.markdown(“”" 请选择操作类型 您可以选择训练新模型或使用现有模型进行预测 “”", unsafe_allow_html=True) 功能选择 option = st.radio(“”, [“🚀 训练新模型 - 使用新数据训练预测模型”, “🔍 推理预测 - 使用模型预测违约风险”], index=0, label_visibility=“hidden”) # 模型训练部分 if “训练新模型” in option: st.markdown(“”" 模型训练 上传训练数据并训练新的预测模型 “”“, unsafe_allow_html=True) # 上传训练数据 train_file = st.file_uploader(“上传训练数据 (CSV格式)”, type=[“csv”]) if train_file is not None: try: # 读取数据 train_data_old = pd.read_csv(train_file) # 显示数据预览 with st.expander(“数据预览”, expanded=True): st.dataframe(train_data_old.head()) col1, col2, col3 = st.columns(3) col1.metric(“总样本数”, train_data_old.shape[0]) col2.metric(“特征数量”, train_data_old.shape[1] - 1) # 训练参数设置 st.subheader(“训练参数”) col1, col2 = st.columns(2) test_size = col1.slider(“测试集比例”, 0.1, 0.4, 0.2, 0.1) n_estimators = col2.slider(“树的数量”, 10, 500, 100, 10) max_depth = col1.slider(“最大深度”, 2, 30, 10, 1) random_state = col2.number_input(“随机种子”, 0, 100, 42) # 开始训练按钮 if st.button(“开始训练模型”, use_container_width=True): with st.spinner(“模型训练中,请稍候…”): # 模拟数据处理 progress_bar = st.progress(0) train_data,preprocessor_params = preprocess_loan_data(train_data_old) joblib.dump(preprocessor_params, ‘loan_preprocessor_params.pkl’) # 步骤1: 数据预处理 time.sleep(1) progress_bar.progress(25) st.success(”✅ 数据预处理完成") # 步骤2: 特征工程 time.sleep(1) progress_bar.progress(50) st.success(“✅ 特征工程完成”) # 步骤3: 模型训练 time.sleep(2) progress_bar.progress(75) # 实际训练代码 (简化版) X = train_data.drop(“loan_status”, axis=1) y = train_data[“loan_status”] # 划分训练测试集 #todo 自己补齐数据划分代码 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y) # 训练模型 #todo 自己补齐调用随机森林算法完成模型的训练 model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=random_state, n_jobs=-1) model.fit(X_train, y_train) # 保存模型 joblib.dump(model, “risk_model.pkl”) # 步骤4: 模型评估 time.sleep(1) progress_bar.progress(100) # 评估模型 #todo 自己补齐调用预测函数完成测试集推理预测 y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] accuracy = accuracy_score(y_test, y_pred) auc = roc_auc_score(y_test, y_proba) # 保存评估结果 st.session_state.model_trained = True st.session_state.accuracy = accuracy st.session_state.auc = auc st.session_state.y_test = y_test st.session_state.y_pred = y_pred st.success(“🎉 模型训练完成!”) # 显示模型性能 st.subheader(“模型性能评估”) col1, col2 = st.columns(2) col1.markdown(f"“” {accuracy*100:.1f}% 准确率 “”“, unsafe_allow_html=True) col2.markdown(f”“” {auc:.3f} AUC 分数 “”“, unsafe_allow_html=True) # 混淆矩阵 st.subheader(“混淆矩阵”) cm = confusion_matrix(y_test, y_pred) fig, ax = plt.subplots(figsize=(6, 4)) sns.heatmap(cm, annot=True, fmt=“d”, cmap=“Blues”, ax=ax) ax.set_xlabel(“预测标签”) ax.set_ylabel(“真实标签”) ax.set_title(“混淆矩阵”) st.pyplot(fig) # 特征重要性 st.subheader(“特征重要性”) feature_importance = pd.DataFrame({ “特征”: X.columns, “重要性”: model.feature_importances_ }).sort_values(“重要性”, ascending=False).head(10) fig, ax = plt.subplots(figsize=(10, 6)) sns.barplot(x=“重要性”, y=“特征”, data=feature_importance, palette=“viridis”, ax=ax) ax.set_title(“Top 10 重要特征”) st.pyplot(fig) except Exception as e: st.error(f"数据处理错误: {str(e)}”) # 推理预测部分 else: st.markdown(“”" 风险预测 上传需要预测的数据,生成违约风险评估报告 “”“, unsafe_allow_html=True) # 上传预测数据 predict_file = st.file_uploader(“上传预测数据 (CSV格式)”, type=[“csv”]) if predict_file is not None: try: # 读取数据 predict_data = pd.read_csv(predict_file) # 显示数据预览 with st.expander(“数据预览”, expanded=True): st.dataframe(predict_data.head()) st.info(f"数据集包含 {predict_data.shape[0]} 个样本,{predict_data.shape[1]} 个特征”) # 检查是否有模型 if not os.path.exists(“risk_model.pkl”): st.warning(“⚠️ 未找到训练好的模型,请先训练模型或使用示例数据”) # 使用示例模型 if st.button(“使用示例模型进行预测”, use_container_width=True): st.info(“正在使用预训练的示例模型进行预测…”) # 创建示例模型 X = np.random.rand(100, 10) y = np.random.randint(0, 2, 100) model = RandomForestClassifier(n_estimators=50, random_state=42) model.fit(X, y) # 生成预测结果 predictions = model.predict(predict_data.values) probas = model.predict_proba(predict_data.values)[:, 1] # 创建结果DataFrame result_df = pd.DataFrame({ “客户ID”: predict_data[“member_id”], “违约概率”: probas, “预测标签”: predictions }) # 添加风险等级 result_df[“风险等级”] = pd.cut( result_df[“违约概率”], bins=[0, 0.2, 0.5, 1], labels=[“低风险”, “中风险”, “高风险”], include_lowest=True ) # 保存结果 st.session_state.prediction_results = result_df else: # 加载模型 model = joblib.load(“risk_model.pkl”) preprocessor_params = joblib.load(‘loan_preprocessor_params.pkl’) # 开始预测按钮 if st.button(“开始风险预测”, use_container_width=True): with st.spinner(“预测进行中,请稍候…”): # 模拟预测过程 progress_bar = st.progress(0) # 预处理推理数据 #todo 自己补齐调用推理数据处理函数完成推理数据的清洗 processed_inference = preprocess_loan_data_inference(predict_data, preprocessor_params) # 步骤1: 数据预处理 time.sleep(1) progress_bar.progress(25) # 步骤2: 特征工程 time.sleep(1) progress_bar.progress(50) # 步骤3: 模型预测 time.sleep(1) progress_bar.progress(75) # 生成预测结果 predictions = model.predict(processed_inference.values) probas = model.predict_proba(processed_inference.values)[:, 1] # 创建结果DataFrame result_df = pd.DataFrame({ “客户ID”: predict_data[“member_id”], “违约概率”: probas, “预测标签”: predictions }) # 添加风险等级 result_df[“风险等级”] = pd.cut( result_df[“违约概率”], bins=[0, 0.2, 0.5, 1], labels=[“低风险”, “中风险”, “高风险”], include_lowest=True ) # 步骤4: 生成报告 time.sleep(1) progress_bar.progress(100) # 保存结果 st.session_state.prediction_results = result_df st.success(“✅ 预测完成!”) except Exception as e: st.error(f"预测错误: {str(e)}“) # 显示预测结果 if “prediction_results” in st.session_state: st.markdown(”“” 预测结果 客户违约风险评估报告 “”“, unsafe_allow_html=True) result_df = st.session_state.prediction_results # 风险分布 st.subheader(“风险分布概览”) col1, col2, col3 = st.columns(3) high_risk = (result_df[“风险等级”] == “高风险”).sum() med_risk = (result_df[“风险等级”] == “中风险”).sum() low_risk = (result_df[“风险等级”] == “低风险”).sum() col1.markdown(f”“” {high_risk} 高风险客户 “”“, unsafe_allow_html=True) col2.markdown(f”“” {med_risk} 中风险客户 “”“, unsafe_allow_html=True) col3.markdown(f”“” {low_risk} 低风险客户 “”“, unsafe_allow_html=True) # 风险分布图 fig, ax = plt.subplots(figsize=(8, 4)) risk_counts = result_df[“风险等级”].value_counts() risk_counts.plot(kind=“bar”, color=[”#4CAF50", “#FFC107”, “#F44336”], ax=ax) ax.set_title(“客户风险等级分布”) ax.set_xlabel(“风险等级”) ax.set_ylabel(“客户数量”) st.pyplot(fig) # 详细预测结果 st.subheader(“详细预测结果”) # 样式函数 def color_risk(val): if val == “高风险”: return “background-color: #ffcdd2; color: #c62828;” elif val == “中风险”: return “background-color: #fff9c4; color: #f57f17;” else: return “background-color: #c8e6c9; color: #388e3c;” # 格式化显示 styled_df = result_df.style.applymap(color_risk, subset=[“风险等级”]) st.dataframe(styled_df.format({ “违约概率”: “{:.2%}” }), height=400) # 下载结果 csv = result_df.to_csv(index=False).encode(“utf-8”) st.download_button( label=“下载预测结果”, data=csv, file_name=“风险预测结果.csv”, mime=“text/csv”, use_container_width=True ) 页脚 st.markdown(“—”) st.markdown(“”" <div style="text-align: center; color: #7f8c8d; font-size: 0.9rem; padding: 1rem;"> © 2023 风控违约预测系统 | 基于Streamlit开发 </div> """, unsafe_allow_html=True) 根据如上代码,仿照如下要求,给出结果完整代码 大数据挖掘:精准营销 一、题目背景 某电信运营商为提升用户 ARPU(每用户平均收入),计划对单宽带用户推广 “单宽转融” 业务(即单宽带用户加装移动网业务,形成融合套餐)。为实现精准营销,需通过数据挖掘技术预测单宽带用户转化为融合套餐用户的可能性,从而针对性制定营销策略。现有一批单宽带用户的行为数据,要求通过数据分析和建模,构建高效的预测模型,辅助运营决策。 二、数据集介绍 1、数据来源:某运营商单宽转融用户的历史数据,包含用户基础信息、资费信息、电信行为数据、客户标签及 DPI 上网行为数据。 2、数据规模:50万+条记录,100+个字段。 3、关键字段说明: 1)用户属性:AGE(年龄),GENDER(性别),ONLINE_DAY(在网天数) 2)消费行为:STMT_AMT(出账金额),PROM_AMT(套餐价格),AVG_STMT_AMT(月均消费) 3)网络使用:DUR(上网时长),DWN_VOL(下载流量),TERM_CNT(接入终端数) 4)业务标签:IF_YHTS(是否投诉),MKT_STAR_GRADE_NAME(用户星级) 5)目标变量(标签):is_rh_next,表示用户是否转为融合套餐(1 为转化,0 为未转化)。 三、题目要求 1、使用 Python 进行数据分析与预处理: 1)加载数据并检查数据质量(缺失值、异常值)。 2)进行特征工程:删除无意义特征、处理缺失值、离散特征编码、标准化 / 归一化。 3)可视化数据分布,分析关键特征与目标变量的相关性。 2、使用 Spark 进行模型训练与测试: 1)构建逻辑回归、决策树、随机森林三种模型。 2)调优模型参数,对比评估指标(准确率、召回率、F1 值、AUC)。 3)选择最优模型,并解释特征重要性。 3、输出要求: 1)给出数据预处理的关键步骤及代码。 2)展示各模型的训练结果与对比分析。 3)说明最终选择的模型及理由。 数据集文件名为Single_breadth_to_melt.csv 文件为gbk编码前一百行数据为 BIL_MONTH ASSET_ROW_ID CCUST_ROW_ID BELONG_CITY MKT_CHANNEL_NAME MKT_CHANNEL_SUB_NAME PREPARE_FLG SERV_START_DT COMB_STAT_NAME FIBER_ACCESS_CATEGORY … AVG_STMT_AMT_LV is_kdts is_itv_up is_mobile_up if_zzzw_up itv_cnt itv_day serv_in_time PROM_AMT_MONTH is_rh_next 0 201706 1-1E6Z49HF 1-UTSNWVU 杭州 NaN 其它部门-未知部门细分-未知 … 0 20140126 现行 普通宽带 … c30-59 0 0 0 0 0 0 41 44.44 0.0 1 201706 3-J591KYI 1-LKFKET 杭州 NaN 其它部门-未知部门细分-未知 … 0 20160406 现行 普通宽带 … e89-129 0 0 0 0 0 0 14 100.00 0.0 2 201706 1-F3YGP4D 1-6T16M75 杭州 营业厅 营业厅-营业服务中心-城市 … 0 20100112 现行 普通宽带 … c30-59 0 0 0 0 0 28 89 44.44 0.0 3 201706 1-1AITRLCN 1-1AB5KV9U 杭州 NaN 其它部门-未知部门细分-未知 … 0 20131017 现行 普通宽带 … c30-59 1 0 0 0 0 10 44 55.56 0.0 4 201706 1-132ZSIVX 1-LPVY5O 杭州 10000号 其它部门-10000客服部-城市 … 0 20130209 现行 普通宽带 … d59-89 0 0 0 0 0 0 52 0.00 0.0
最新发布
07-02
# -*- coding: utf-8 -*- """ 京东高价值客户识别与全链路行为预测系统 作者:李梓翀 李富生 数据来源:京东公开数据(模拟生成) """ import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from datetime import datetime, timedelta import random import time import os from faker import Faker from sklearn.preprocessing import LabelEncoder, StandardScaler from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.linear_model import LogisticRegression from sklearn.cluster import KMeans from sklearn.metrics import (roc_auc_score, precision_score, recall_score, f1_score, mean_absolute_error, roc_curve) from sklearn.decomposition import PCA # -------------------------- # 数据爬取与模拟生成 # -------------------------- def generate_jd_simulation_data(num_users=5000, num_records=50000): """ 模拟生成京东用户行为数据 """ print("开始生成模拟京东数据...") fake = Faker(&#39;zh_CN&#39;) np.random.seed(42) # 创建用户基础数据 users = pd.DataFrame({ &#39;user_id&#39;: [f&#39;U{str(i).zfill(6)}&#39; for i in range(1, num_users+1)], &#39;age&#39;: np.random.randint(18, 65, num_users), &#39;gender&#39;: np.random.choice([&#39;男&#39;, &#39;女&#39;], num_users, p=[0.55, 0.45]), &#39;city&#39;: [fake.city() for _ in range(num_users)], &#39;is_plus_member&#39;: np.random.choice([0, 1], num_users, p=[0.7, 0.3]), &#39;join_date&#39;: [fake.date_between(start_date=&#39;-3y&#39;, end_date=&#39;today&#39;) for _ in range(num_users)] }) # 创建行为数据 behavior_types = [&#39;浏览&#39;, &#39;加购&#39;, &#39;购买&#39;, &#39;评价&#39;, &#39;收藏&#39;] categories = { &#39;家电&#39;: [&#39;冰箱&#39;, &#39;洗衣机&#39;, &#39;空调&#39;, &#39;电视&#39;, &#39;微波炉&#39;], &#39;手机&#39;: [&#39;智能手机&#39;, &#39;配件&#39;, &#39;平板&#39;, &#39;智能手表&#39;], &#39;电脑&#39;: [&#39;笔记本&#39;, &#39;台式机&#39;, &#39;显示器&#39;, &#39;外设&#39;], &#39;数码&#39;: [&#39;相机&#39;, &#39;耳机&#39;, &#39;音箱&#39;, &#39;存储设备&#39;], &#39;家居&#39;: [&#39;家具&#39;, &#39;家纺&#39;, &#39;厨具&#39;, &#39;灯具&#39;] } records = [] for _ in range(num_records): user_id = f&#39;U{str(np.random.randint(1, num_users+1)).zfill(6)}&#39; behavior_time = fake.date_time_between(start_date=&#39;-90d&#39;, end_date=&#39;now&#39;) # 随机选择品类和子类 main_cat = random.choice(list(categories.keys())) sub_cat = random.choice(categories[main_cat]) # 行为类型概率分布 behavior_prob = [0.5, 0.2, 0.15, 0.1, 0.05] behavior_type = np.random.choice(behavior_types, p=behavior_prob) # 订单相关数据 order_amount = 0 if behavior_type == &#39;购买&#39;: # 高价商品概率 if main_cat == &#39;家电&#39; and np.random.random() < 0.3: order_amount = np.random.uniform(3000, 20000) else: order_amount = np.random.uniform(100, 3000) # 促销活动参与 is_promotion = 1 if np.random.random() < 0.4 else 0 # 物流评分 delivery_rating = np.random.randint(3, 6) if behavior_type == &#39;购买&#39; else 0 records.append({ &#39;user_id&#39;: user_id, &#39;behavior_time&#39;: behavior_time, &#39;behavior_type&#39;: behavior_type, &#39;main_category&#39;: main_cat, &#39;sub_category&#39;: sub_cat, &#39;order_amount&#39;: order_amount, &#39;is_promotion&#39;: is_promotion, &#39;delivery_rating&#39;: delivery_rating }) # 创建DataFrame df = pd.DataFrame(records) # 添加未来行为标签(模拟未来3个月行为) print("添加未来行为标签...") user_purchase_future = df[df[&#39;behavior_type&#39;] == &#39;购买&#39;].groupby(&#39;user_id&#39;)[&#39;order_amount&#39;].sum().reset_index() # 修正语法错误:括号匹配 user_purchase_future[&#39;will_buy_high_end&#39;] = np.where( (user_purchase_future[&#39;order_amount&#39;] > 5000) & (np.random.random(len(user_purchase_future)) > 0.3), 1, 0) # PLUS会员续费倾向 - 修正语法错误 plus_users = users[users[&#39;is_plus_member&#39;] == 1][&#39;user_id&#39;].tolist() user_purchase_future[&#39;will_renew_plus&#39;] = np.where( user_purchase_future[&#39;user_id&#39;].isin(plus_users), np.random.choice([0, 1], len(user_purchase_future)), 0) # 合并数据 df = pd.merge(df, users, on=&#39;user_id&#39;, how=&#39;left&#39;) df = pd.merge(df, user_purchase_future[[&#39;user_id&#39;, &#39;will_buy_high_end&#39;, &#39;will_renew_plus&#39;]], on=&#39;user_id&#39;, how=&#39;left&#39;).fillna(0) # 保存数据 os.makedirs(&#39;data&#39;, exist_ok=True) df.to_csv(&#39;data/jd_simulated_data.csv&#39;, index=False) print(f"模拟数据生成完成,共 {len(df)} 条记录,保存至 data/jd_simulated_data.csv") return df # -------------------------- # 数据预处理 # -------------------------- def preprocess_data(df): """数据预处理与特征工程""" print("\n开始数据预处理与特征工程...") # 1. 数据清洗 # 过滤异常订单(金额异常) df = df[df[&#39;order_amount&#39;] <= 50000] # 修复时间戳错误(示例:修复未来时间戳) current_date = datetime.now() df = df[df[&#39;behavior_time&#39;] <= current_date] # 2. 特征工程 - 基础特征 # 计算用户活跃天数(最近90天) active_days = df.groupby(&#39;user_id&#39;)[&#39;behavior_time&#39;].apply( lambda x: x.dt.date.nunique()).reset_index(name=&#39;active_days&#39;) # 促销敏感度(参与促销活动比例) promo_sensitivity = df[df[&#39;is_promotion&#39;] == 1].groupby(&#39;user_id&#39;).size().reset_index(name=&#39;promo_count&#39;) total_actions = df.groupby(&#39;user_id&#39;).size().reset_index(name=&#39;total_actions&#39;) promo_sensitivity = pd.merge(promo_sensitivity, total_actions, on=&#39;user_id&#39;) promo_sensitivity[&#39;promo_sensitivity&#39;] = promo_sensitivity[&#39;promo_count&#39;] / promo_sensitivity[&#39;total_actions&#39;] # 品类浏览集中度 category_concentration = df.groupby([&#39;user_id&#39;, &#39;main_category&#39;]).size().reset_index(name=&#39;category_count&#39;) category_concentration = category_concentration.groupby(&#39;user_id&#39;)[&#39;category_count&#39;].apply( lambda x: (x.max() / x.sum())).reset_index(name=&#39;category_concentration&#39;) # 3. 高价值客户标签定义 high_value_criteria = df.groupby(&#39;user_id&#39;).agg( total_spend=(&#39;order_amount&#39;, &#39;sum&#39;), purchase_count=(&#39;behavior_type&#39;, lambda x: (x == &#39;购买&#39;).sum()), category_count=(&#39;main_category&#39;, &#39;nunique&#39;) ).reset_index() high_value_criteria[&#39;is_high_value&#39;] = np.where( (high_value_criteria[&#39;total_spend&#39;] > 5000) | (high_value_criteria[&#39;purchase_count&#39;] > 8) | (high_value_criteria[&#39;category_count&#39;] >= 3), 1, 0) # 4. 合并特征 features = pd.merge(active_days, promo_sensitivity[[&#39;user_id&#39;, &#39;promo_sensitivity&#39;]], on=&#39;user_id&#39;) features = pd.merge(features, category_concentration, on=&#39;user_id&#39;) features = pd.merge(features, high_value_criteria[[&#39;user_id&#39;, &#39;is_high_value&#39;]], on=&#39;user_id&#39;) # 5. 添加用户基本信息 user_base = df[[&#39;user_id&#39;, &#39;age&#39;, &#39;gender&#39;, &#39;city&#39;, &#39;is_plus_member&#39;, &#39;join_date&#39;]].drop_duplicates() features = pd.merge(features, user_base, on=&#39;user_id&#39;) # 6. 添加时间相关特征 df[&#39;last_activity&#39;] = df.groupby(&#39;user_id&#39;)[&#39;behavior_time&#39;].transform(&#39;max&#39;) features[&#39;last_activity_gap&#39;] = (datetime.now() - features[&#39;join_date&#39;]).dt.days # 7. 添加行为统计特征 behavior_counts = pd.crosstab(df[&#39;user_id&#39;], df[&#39;behavior_type&#39;]).reset_index() features = pd.merge(features, behavior_counts, on=&#39;user_id&#39;) # 8. 品类偏好特征 for cat in [&#39;家电&#39;, &#39;手机&#39;, &#39;电脑&#39;, &#39;数码&#39;, &#39;家居&#39;]: cat_users = df[df[&#39;main_category&#39;] == cat][&#39;user_id&#39;].unique() features[f&#39;prefers_{cat}&#39;] = np.where(features[&#39;user_id&#39;].isin(cat_users), 1, 0) print(f"特征工程完成,共生成 {len(features.columns)} 个特征") return features # -------------------------- # 探索性数据分析 (EDA) # -------------------------- def perform_eda(df, features): """执行探索性数据分析""" print("\n开始探索性数据分析...") # 设置绘图风格 sns.set_style("whitegrid") plt.figure(figsize=(18, 12)) # 1. 用户行为类型分布 plt.subplot(2, 2, 1) behavior_counts = df[&#39;behavior_type&#39;].value_counts() sns.barplot(x=behavior_counts.index, y=behavior_counts.values, palette="viridis") plt.title(&#39;用户行为类型分布&#39;) plt.ylabel(&#39;数量&#39;) # 2. PLUS会员与非会员客单价对比 plt.subplot(2, 2, 2) purchase_df = df[df[&#39;behavior_type&#39;] == &#39;购买&#39;] sns.boxplot(x=&#39;is_plus_member&#39;, y=&#39;order_amount&#39;, data=purchase_df, palette="Set2") plt.title(&#39;PLUS会员 vs 非会员客单价对比&#39;) plt.xlabel(&#39;PLUS会员&#39;) plt.ylabel(&#39;订单金额&#39;) # 3. 物流评分与复购率关系 plt.subplot(2, 2, 3) # 计算复购率 repurchase_users = purchase_df.groupby(&#39;user_id&#39;).filter(lambda x: len(x) > 1)[&#39;user_id&#39;].unique() purchase_df[&#39;is_repurchase&#39;] = purchase_df[&#39;user_id&#39;].isin(repurchase_users).astype(int) # 按物流评分分组计算复购率 delivery_repurchase = purchase_df.groupby(&#39;delivery_rating&#39;)[&#39;is_repurchase&#39;].mean().reset_index() sns.lineplot(x=&#39;delivery_rating&#39;, y=&#39;is_repurchase&#39;, data=delivery_repurchase, marker=&#39;o&#39;, linewidth=2.5, color=&#39;darkorange&#39;) plt.title(&#39;物流评分对复购率的影响&#39;) plt.xlabel(&#39;物流评分&#39;) plt.ylabel(&#39;复购率&#39;) plt.ylim(0, 1) # 4. 高价值客户特征热力图 plt.subplot(2, 2, 4) corr_matrix = features[[&#39;active_days&#39;, &#39;promo_sensitivity&#39;, &#39;category_concentration&#39;, &#39;购买&#39;, &#39;加购&#39;, &#39;is_high_value&#39;]].corr() sns.heatmap(corr_matrix, annot=True, cmap=&#39;coolwarm&#39;, fmt=".2f") plt.title(&#39;行为特征相关性&#39;) plt.tight_layout() plt.savefig(&#39;results/eda_results.png&#39;, dpi=300) plt.show() # 5. 高价值客户人口统计特征 plt.figure(figsize=(12, 5)) plt.subplot(1, 2, 1) sns.countplot(x=&#39;gender&#39;, hue=&#39;is_high_value&#39;, data=features, palette="Set1") plt.title(&#39;高价值客户性别分布&#39;) plt.xlabel(&#39;性别&#39;) plt.ylabel(&#39;数量&#39;) plt.subplot(1, 2, 2) sns.boxplot(x=&#39;is_high_value&#39;, y=&#39;age&#39;, data=features, palette="Set2") plt.title(&#39;高价值客户年龄分布&#39;) plt.xlabel(&#39;是否高价值客户&#39;) plt.ylabel(&#39;年龄&#39;) plt.tight_layout() plt.savefig(&#39;results/high_value_demographics.png&#39;, dpi=300) plt.show() print("EDA分析完成,结果保存至 results/ 目录") # -------------------------- # 预测模型构建 # -------------------------- def build_prediction_models(features, target_column): """构建预测模型""" print(f"\n构建预测模型: {target_column}") # 1. 数据准备 # 选择特征 model_features = features.drop([&#39;user_id&#39;, &#39;join_date&#39;, &#39;will_buy_high_end&#39;, &#39;will_renew_plus&#39;], axis=1, errors=&#39;ignore&#39;) # 处理分类变量 categorical_cols = [&#39;gender&#39;, &#39;city&#39;] model_features = pd.get_dummies(model_features, columns=categorical_cols, drop_first=True) # 定义目标变量 y = features[target_column] # 2. 划分训练集/测试集 X_train, X_test, y_train, y_test = train_test_split( model_features, y, test_size=0.25, random_state=42, stratify=y) # 3. 模型初始化 models = { &#39;XGBoost&#39;: RandomForestClassifier( n_estimators=150, max_depth=8, min_samples_split=10, class_weight=&#39;balanced&#39;, random_state=42 ), &#39;随机森林&#39;: RandomForestClassifier( n_estimators=150, max_depth=8, min_samples_split=10, class_weight=&#39;balanced&#39;, random_state=42 ), &#39;逻辑回归&#39;: LogisticRegression( max_iter=1000, class_weight=&#39;balanced&#39;, penalty=&#39;l2&#39;, C=0.1, random_state=42, solver=&#39;liblinear&#39; ) } # 4. 模型训练与评估 results = {} feature_importances = {} for name, model in models.items(): print(f"训练 {name} 模型...") start_time = time.time() model.fit(X_train, y_train) train_time = time.time() - start_time # 预测 y_pred = model.predict(X_test) y_proba = model.predict_proba(X_test)[:, 1] if hasattr(model, &#39;predict_proba&#39;) else [0]*len(y_test) # 关键指标计算 auc = roc_auc_score(y_test, y_proba) if len(np.unique(y_test)) > 1 else 0.5 precision = precision_score(y_test, y_pred, zero_division=0) recall = recall_score(y_test, y_pred, zero_division=0) f1 = f1_score(y_test, y_pred, zero_division=0) # 自定义加权MAE(高价商品权重更高) weights = np.where(features.loc[y_test.index, &#39;is_high_value&#39;] == 1, 2.0, 1.0) mae = mean_absolute_error(y_test, y_proba, sample_weight=weights) if len(np.unique(y_test)) > 1 else 0 results[name] = { &#39;AUC&#39;: auc, &#39;精确率&#39;: precision, &#39;召回率&#39;: recall, &#39;F1分数&#39;: f1, &#39;加权MAE&#39;: mae, &#39;训练时间(秒)&#39;: train_time } # 保存重要特征 if hasattr(model, &#39;feature_importances_&#39;): feat_imp = pd.Series(model.feature_importances_, index=X_train.columns) feature_importances[name] = feat_imp.sort_values(ascending=False) # 绘制ROC曲线 if len(np.unique(y_test)) > 1: plt.figure(figsize=(8, 6)) fpr, tpr, _ = roc_curve(y_test, y_proba) plt.plot(fpr, tpr, label=f&#39;{name} (AUC = {auc:.2f})&#39;) plt.plot([0, 1], [0, 1], &#39;k--&#39;) plt.xlabel(&#39;假阳性率&#39;) plt.ylabel(&#39;真阳性率&#39;) plt.title(f&#39;{target_column} ROC曲线&#39;) plt.legend() plt.savefig(f&#39;results/{target_column}_{name}_roc_curve.png&#39;, dpi=300) plt.close() # 5. 特征重要性可视化 for model_name, imp in feature_importances.items(): plt.figure(figsize=(10, 8)) imp.head(15).sort_values().plot(kind=&#39;barh&#39;) plt.title(f&#39;{model_name} - 特征重要性 (Top 15)&#39;) plt.tight_layout() plt.savefig(f&#39;results/{target_column}_{model_name}_feature_importance.png&#39;, dpi=300) plt.close() return results, models # -------------------------- # 客户分群与画像 # -------------------------- def customer_segmentation(features): """客户分群与画像生成""" print("\n进行客户分群...") # 选择特征 cluster_features = features[[ &#39;active_days&#39;, &#39;promo_sensitivity&#39;, &#39;category_concentration&#39;, &#39;浏览&#39;, &#39;加购&#39;, &#39;购买&#39;, &#39;age&#39; ]] # 标准化 scaler = StandardScaler() X_cluster = scaler.fit_transform(cluster_features) # KMeans聚类 kmeans = KMeans(n_clusters=5, random_state=42, n_init=10) features[&#39;cluster&#39;] = kmeans.fit_predict(X_cluster) # 分析群体特征 cluster_profiles = features.groupby(&#39;cluster&#39;).agg({ &#39;active_days&#39;: &#39;mean&#39;, &#39;promo_sensitivity&#39;: &#39;mean&#39;, &#39;category_concentration&#39;: &#39;mean&#39;, &#39;浏览&#39;: &#39;mean&#39;, &#39;加购&#39;: &#39;mean&#39;, &#39;购买&#39;: &#39;mean&#39;, &#39;age&#39;: &#39;mean&#39;, &#39;is_high_value&#39;: &#39;mean&#39;, &#39;will_buy_high_end&#39;: &#39;mean&#39;, &#39;will_renew_plus&#39;: &#39;mean&#39;, &#39;is_plus_member&#39;: &#39;mean&#39; }).reset_index() # 重命名集群 cluster_names = { 0: &#39;低价值观望者&#39;, 1: &#39;高价值忠诚客户&#39;, 2: &#39;年轻活跃用户&#39;, 3: &#39;促销敏感型用户&#39;, 4: &#39;高消费低频用户&#39; } cluster_profiles[&#39;cluster_name&#39;] = cluster_profiles[&#39;cluster&#39;].map(cluster_names) features[&#39;cluster_name&#39;] = features[&#39;cluster&#39;].map(cluster_names) # 可视化 - 群体价值分布 plt.figure(figsize=(10, 6)) sns.barplot(x=&#39;cluster_name&#39;, y=&#39;is_high_value&#39;, data=cluster_profiles, palette="viridis") plt.title(&#39;各客户群体高价值比例&#39;) plt.xlabel(&#39;客户群体&#39;) plt.ylabel(&#39;高价值客户比例&#39;) plt.xticks(rotation=15) plt.tight_layout() plt.savefig(&#39;results/cluster_high_value_distribution.png&#39;, dpi=300) plt.close() # 保存客户画像 cluster_profiles.to_csv(&#39;results/customer_cluster_profiles.csv&#39;, index=False) features.to_csv(&#39;results/customer_segmented_data.csv&#39;, index=False) print("客户分群完成,结果保存至 results/ 目录") return cluster_profiles # -------------------------- # 生成业务报告 # -------------------------- def generate_business_report(cluster_profiles, model_results): """生成业务策略报告""" print("\n生成业务策略报告...") report = """ # 京东高价值客户识别与行为预测分析报告 ## 1. 项目概述 本项目通过分析京东用户行为数据,构建高价值客户识别模型并预测其全链路行为。研究目标包括: - 建立高价值客户评估体系 - 预测高价商品购买概率(家电3C等) - 预测PLUS会员续费倾向 - 提出精准营销策略 ## 2. 关键发现 ### 2.1 高价值客户特征 - 高价值客户占比: {:.1f}% - 高价值客户主要特征: - 活跃天数比普通客户高{:.1f}倍 - 促销敏感度比普通客户高{:.1f}% - 跨品类消费比例比普通客户高{:.1f}倍 ### 2.2 客户群体分析 我们识别出5类典型客户群体: """.format( cluster_profiles[&#39;is_high_value&#39;].mean() * 100, cluster_profiles[cluster_profiles[&#39;is_high_value&#39;] > 0.5][&#39;active_days&#39;].mean() / cluster_profiles[cluster_profiles[&#39;is_high_value&#39;] < 0.3][&#39;active_days&#39;].mean(), (cluster_profiles[cluster_profiles[&#39;is_high_value&#39;] > 0.5][&#39;promo_sensitivity&#39;].mean() - cluster_profiles[cluster_profiles[&#39;is_high_value&#39;] < 0.3][&#39;promo_sensitivity&#39;].mean()) * 100, cluster_profiles[cluster_profiles[&#39;is_high_value&#39;] > 0.5][&#39;category_concentration&#39;].mean() / cluster_profiles[cluster_profiles[&#39;is_high_value&#39;] < 0.3][&#39;category_concentration&#39;].mean() ) for _, row in cluster_profiles.iterrows(): report += "- **{}**: {:.1f}%为高价值客户,平均年龄{:.1f}岁,主要特征:{}\n".format( row[&#39;cluster_name&#39;], row[&#39;is_high_value&#39;] * 100, row[&#39;age&#39;], get_cluster_description(row) ) report += "\n### 2.3 预测模型性能\n" # 高价商品购买预测结果 report += "**高价商品购买预测**:\n" for model, metrics in model_results[&#39;will_buy_high_end&#39;].items(): report += ("- {}: AUC={:.3f}, 精确率={:.3f}, 召回率={:.3f}, " "F1={:.3f}\n").format( model, metrics[&#39;AUC&#39;], metrics[&#39;精确率&#39;], metrics[&#39;召回率&#39;], metrics[&#39;F1分数&#39;]) # PLUS会员续费预测结果 report += "\n**PLUS会员续费预测**:\n" for model, metrics in model_results[&#39;will_renew_plus&#39;].items(): report += ("- {}: AUC={:.3f}, 精确率={:.3f}, 召回率={:.3f}, " "F1={:.3f}\n").format( model, metrics[&#39;AUC&#39;], metrics[&#39;精确率&#39;], metrics[&#39;召回率&#39;], metrics[&#39;F1分数&#39;]) report += """ ## 3. 业务建议 ### 3.1 高价值客户运营策略 - **高价值忠诚客户**: 提供专属客服、优先配送和限量商品访问权限 - **高消费低频用户**: 通过个性化推荐提高购买频率,推送高端新品 - **促销敏感型用户**: 定向发送优惠券和限时促销信息 ### 3.2 PLUS会员增长策略 - 针对高价值客户群体提供专属会员优惠 - 预测有流失风险的会员,提供续费激励 - 为新会员提供首单立减优惠 ### 3.3 家电3C品类增长策略 - 对高价商品潜在购买者提供分期免息服务 - 结合用户浏览行为推送相关配件和延保服务 - 针对跨品类用户提供组合优惠 ## 4. 实施计划 1. 部署预测模型到京东营销系统 2. 开发客户分群运营平台 3. 设计个性化营销活动 4. 建立效果监测指标体系 """ # 保存报告 os.makedirs(&#39;results&#39;, exist_ok=True) with open(&#39;results/business_report.md&#39;, &#39;w&#39;, encoding=&#39;utf-8&#39;) as f: f.write(report) print("业务报告生成完成,保存至 results/business_report.md") return report def get_cluster_description(row): """生成客户群体描述""" desc_map = { &#39;低价值观望者&#39;: "浏览多购买少,促销敏感度低", &#39;高价值忠诚客户&#39;: "高活跃、高消费、多品类购买", &#39;年轻活跃用户&#39;: "活跃度高但消费水平中等", &#39;促销敏感型用户&#39;: "对促销活动高度敏感,购买集中在促销期", &#39;高消费低频用户&#39;: "购买频次低但单次消费金额高" } return desc_map.get(row[&#39;cluster_name&#39;], "未知群体") # -------------------------- # 主执行流程 # -------------------------- if __name__ == "__main__": # 创建结果目录 os.makedirs(&#39;data&#39;, exist_ok=True) os.makedirs(&#39;results&#39;, exist_ok=True) # 生成模拟数据 if not os.path.exists(&#39;data/jd_simulated_data.csv&#39;): df = generate_jd_simulation_data() else: df = pd.read_csv(&#39;data/jd_simulated_data.csv&#39;, parse_dates=[&#39;behavior_time&#39;]) print("加载现有模拟数据...") # 预处理与特征工程 features = preprocess_data(df) # 探索性数据分析 perform_eda(df, features) # 构建预测模型 model_results = {} high_end_results, high_end_models = build_prediction_models(features, &#39;will_buy_high_end&#39;) model_results[&#39;will_buy_high_end&#39;] = high_end_results plus_renew_results, plus_models = build_prediction_models(features, &#39;will_renew_plus&#39;) model_results[&#39;will_renew_plus&#39;] = plus_renew_results # 客户分群 cluster_profiles = customer_segmentation(features) # 生成业务报告 report = generate_business_report(cluster_profiles, model_results) print("\n" + "="*50) print("京东高价值客户分析完成!") print("="*50) print("结果文件:") print("- 原始数据: data/jd_simulated_data.csv") print("- 特征数据: results/customer_segmented_data.csv") print("- 客户画像: results/customer_cluster_profiles.csv") print("- 分析报告: results/business_report.md") print("- 可视化图表: results/ 目录下的图片文件") 这段代码报了如下的错误 Cell In[2], line 564 561 print("加载现有模拟数据...") 563 # 预处理与特征工程 --> 564 features = preprocess_data(df) 566 # 探索性数据分析 567 perform_eda(df, features) Cell In[2], line 180 178 # 6. 添加时间相关特征 179 df[&#39;last_activity&#39;] = df.groupby(&#39;user_id&#39;)[&#39;behavior_time&#39;].transform(&#39;max&#39;) --> 180 features[&#39;last_activity_gap&#39;] = (datetime.now() - features[&#39;join_date&#39;]).dt.days 182 # 7. 添加行为统计特征 183 behavior_counts = pd.crosstab(df[&#39;user_id&#39;], df[&#39;behavior_type&#39;]).reset_index() File c:\Users\lzc\anaconda3\envs\lzc\lib\site-packages\pandas\core\ops\common.py:72, in _unpack_zerodim_and_defer.<locals>.new_method(self, other) 68 return NotImplemented 70 other = item_from_zerodim(other) ---> 72 return method(self, other) File c:\Users\lzc\anaconda3\envs\lzc\lib\site-packages\pandas\core\arraylike.py:114, in OpsMixin.__rsub__(self, other) 112 @unpack_zerodim_and_defer("__rsub__") 113 def __rsub__(self, other): --> 114 return self._arith_method(other, roperator.rsub) ... File c:\Users\lzc\anaconda3\envs\lzc\lib\site-packages\pandas\core\roperator.py:15, in rsub(left, right) 14 def rsub(left, right): ---> 15 return right - left TypeError: unsupported operand type(s) for -: &#39;Timestamp&#39; and &#39;datetime.date&#39;这个报错是怎么回事,帮我改进一下
06-04
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值