目标
分别用IV值和随机森林等进行特征选择……
数据预处理
在做特征筛选前,先对数据进行预处理,代码如下
data = pd.read_csv("D://project//金融数据分析//data.csv", encoding='gbk')
# 获取分类
y = data['status']
x = data.drop('status', axis=1)
# 删除无用的列
x.drop(['custid', 'trade_no', 'bank_card_no', 'id_name'], axis=1)
# 缺失值/时间等字段值处理
x['first_transaction_time_year'] = pd.to_datetime(data['first_transaction_time']).dt.year
x['first_transaction_time_month'] = pd.to_datetime(data['first_transaction_time']).dt.month
x['first_transaction_time_day'] = pd.to_datetime(data['first_transaction_time']).dt.day
x['latest_query_time_year'] = pd.to_datetime(data['latest_query_time']).dt.year
x['latest_query_time_month'] = pd.to_datetime(data['latest_query_time']).dt.month
x['latest_query_time_day'] = pd.to_datetime(data['latest_query_time']).dt.day
x['loans_latest_time_year'] = pd.to_datetime(data['loans_latest_time']).dt.year
x['loans_latest_time_month'] = pd.to_datetime(data['loans_latest_time']).dt.month
x['loans_latest_time_day'] = pd.to_datetime(data['loans_latest_time']).dt.day
x.fillna(x.median(), inplace=True)
# 删除原有时间字段
x.drop(["first_transaction_time", "latest_query_time", "loans_latest_time", "id_name"], axis=1, inplace=True)
for cl in x.columns:
count = x[cl].count()
if len(list(x[cl].unique())) in [1, count, count - 1]:
x.drop(cl, axis=1, inplace=True)
# 城市处理
n = set(x['reg_preference_for_trad'])
dic = {}
for i, j in enumerate(n):
dic[j] = i
x['reg_preference_for_trad'] = x['reg_preference_for_trad'].map(dic)
x['student_feature'] = x['student_feature'].fillna(0) # 学生字段补0
x.fillna(x.median(), inplace=True) # 其余空字段,设置为中位数
print(x.info())
IV值筛选
含义
直观逻辑上大体可以这样理解“用IV去衡量变量预测能力”这件事情:我们假设在一个分类问题中,目标变量的类别有两类:Y1,Y2。对于一个待预测的个体A,要判断A属于Y1还是Y2,我们是需要一定的信息的,假设这个信息总量是I,而这些所需要的信息,就蕴含在所有的自变量C1,C2,C3,……,Cn中,那么,对于其中的一个变量Ci来说,其蕴含的信息越多,那么它对于判断A属于Y1还是Y2的贡献就越大,Ci的信息价值就越大,Ci的IV就越大,它就越应该进入到入模变量列表中。
实现代码
#IV值计算
def CalcIV(Xvar, Yvar):
N_0 = np.sum(Yvar == 0)
N_1 = np.sum(Yvar == 1)
N_0_group = np.zeros(np.unique(Xvar).shape)
N_1_group = np.zeros(np.unique(Xvar).shape)
for i in range(len(np.unique(Xvar))):
N_0_group[i] = Yvar[(Xvar == np.unique(Xvar)[i]) & (Yvar == 0)].count()
N_1_group[i] = Yvar[(Xvar == np.unique(Xvar)[i]) & (Yvar == 1)].count()
iv = np.sum((N_0_group / N_0 - N_1_group / N_1) * np.log((N_0_group / N_0) / (N_1_group / N_1)))
if iv >= 1.0: ## 处理极端值
iv = 1
return iv
def caliv_batch(df, Yvar):
ivlist = []
for col in df.columns:
iv = CalcIV(df[col], Yvar)
ivlist.append(iv)
names = list(df.columns)
iv_df = pd.DataFrame({'Var': names, 'Iv': ivlist}, columns=['Var', 'Iv'])
return iv_df, ivlist
im_iv, ivl = caliv_batch(x, y)
print('im_iv:', im_iv)
print('ivl:', ivl)
参考文章:https://blog.youkuaiyun.com/zhangyunpeng0922/article/details/84591046#2__87
随机森林
含义
可参考:https://blog.youkuaiyun.com/zjuPeco/article/details/77371645?locationNum=7&fps=1
实现代码
feat_lables = x.columns
forest = RandomForestClassifier(n_estimators=10000, random_state=0, n_jobs=1)
forest.fit(x, y)
importance = forest.feature_importances_
imp_result = np.argsort(importance)[::-1]
for i in range(x.shape[1]):
print("%2d. %-*s %f" % (i + 1, 30, feat_lables[i], importance[imp_result[i]]))
总结
最近加班太多,后面再补坑吧。。。