达观杯-特征工程4（特征选择）

最新推荐文章于 2021-03-30 19:27:57 发布

原创最新推荐文章于 2021-03-30 19:27:57 发布 · 200 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#特征工程 #selectfrommodel #达观杯

机器学习同时被 3 个专栏收录

9 篇文章

订阅专栏

达观杯

7 篇文章

订阅专栏

特征工程

5 篇文章

订阅专栏

本文介绍如何使用线性SVM和逻辑回归进行特征选择，通过tf-idf(word)挑选关键特征并保存，适用于特征工程实践。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前面在特征工程2中也有一些关于特征选取的内容，但是没有完整。下面是从模型中进行特征选取的一些知识。

linearSVM

""" 用linearsvm从tfidf(word)中挑选特征，并将结果保存到本地 tfidf(article)可做类似处理 """
import time 
import pickle 
from sklearn.feature_selection import SelectFromModel 
from sklearn.svm import LinearSVC 
t_start = time.time() 
"""读取特征""" 
with open('tfidf_word.pkl', 'rb') as f: 
    x_train, y_train, x_test = pickle.load(f) 
"""进行特征选择""" 
lsvc = LinearSVC(C=0.5, dual=False).fit(x_train, y_train) 
slt = SelectFromModel(lsvc, prefit=True) 
x_train_s = slt.transform(x_train) 
x_test_s = slt.transform(x_test) 
"""保存选择后的特征至本地""" 
num_features = x_train_s.shape[1] 
with open('linearsvm-tfidf(word).pkl', 'wb') as f:
	pickle.dump((x_train_s, y_train, x_test_s), data_f)

t_end = time.time()

一些注释：现在在sklearn中继承了关于从模型中进行特征选择的代码库。只需要import即可。
需要注意的是，只要模型带有conef_或者feature_importances属性，就可以用selectfrommodel方法，关于选择的阈值完全可以自己定，阈值不同通常选择的特征也不完全一样，其二是同样要根据实际情况来进行选择。
使用此方法的时候要注意一定是要先将原来的模型，比如选的线性模型，一定要先将线性模型训练好才行。

LR

import time 
import pickle 
from sklearn.feature_selection import SelectFromModel 
from sklearn.linear_model import LogisticRegression 
t_start = time.time() 
"""读取tfidf(word)特征""" 
with open('tfidf_word.pkl', 'rb') as fp: 
    x_train, y_train, x_test = pickle.load(fp) 
"""进行特征选择""" 
LR = LogisticRegression(C=120, dual=False).fit(x_train, y_train) 
slt = SelectFromModel(LR, prefit=True) 
x_train_s = slt.transform(x_train) 
x_test_s = slt.transform(x_test) 
"""保存选择后的特征至本地""" 
num_features = x_train_s.shape[1] 
with open('lr-tfidf(word).pkl', 'wb') as data_f: 
    pickle.dump((x_train_s, y_train, x_test_s), data_f) 
t_end = time.time()