前面在特征工程2中也有一些关于特征选取的内容,但是没有完整。下面是从模型中进行特征选取的一些知识。
linearSVM
""" 用linearsvm从tfidf(word)中挑选特征,并将结果保存到本地 tfidf(article)可做类似处理 """
import time
import pickle
from sklearn.feature_selection import SelectFromModel
from sklearn.svm import LinearSVC
t_start = time.time()
"""读取特征"""
with open('tfidf_word.pkl', 'rb') as f:
x_train, y_train, x_test = pickle.load(f)
"""进行特征选择"""
lsvc = LinearSVC(C=0.5, dual=False).fit(x_train, y_train)
slt = SelectFromModel(lsvc, prefit=True)
x_train_s = slt.transform(x_train)
x_test_s = slt.transform(x_test)
"""保存选择后的特征至本地"""
num_features = x_train_s.shape[1]
with open('linearsvm-tfidf(word).pkl', 'wb') as f:
pickle.dump((x_train_s, y_train, x_test_s), data_f)
t_end = time.time()
一些注释:现在在sklearn中继承了关于从模型中进行特征选择的代码库。只需要import即可。
需要注意的是,只要模型带有conef_或者feature_importances属性,就可以用selectfrommodel方法,关于选择的阈值完全可以自己定,阈值不同通常选择的特征也不完全一样,其二是同样要根据实际情况来进行选择。
使用此方法的时候要注意一定是要先将原来的模型,比如选的线性模型,一定要先将线性模型训练好才行。
LR
import time
import pickle
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
t_start = time.time()
"""读取tfidf(word)特征"""
with open('tfidf_word.pkl', 'rb') as fp:
x_train, y_train, x_test = pickle.load(fp)
"""进行特征选择"""
LR = LogisticRegression(C=120, dual=False).fit(x_train, y_train)
slt = SelectFromModel(LR, prefit=True)
x_train_s = slt.transform(x_train)
x_test_s = slt.transform(x_test)
"""保存选择后的特征至本地"""
num_features = x_train_s.shape[1]
with open('lr-tfidf(word).pkl', 'wb') as data_f:
pickle.dump((x_train_s, y_train, x_test_s), data_f)
t_end = time.time()
attention!
prefit :布尔型变量,默认为False,是否为训练完的模型,(注意不能是cv,GridSearchCV或者clone the estimator得到的),如果是False的话则先fit,再transform。