赛题的分类:
太阳底下无新事,都是出现过的赛题只是换了场景和数据
建模与问题解决流程
了解场景和目标
了解评估准则
1.数据处理
数据清洗
观察数据是否平衡 比如广告点击 不点才是大概率
需要清除离心点,不然会破坏模型性能
不可信的样本丢掉(需要人根据常识判断)
缺省值极多的字段考虑不用
数据采样
数据采样:上下采样和保证样本均衡(1:3差不多,1:10差距太大)
数据归一化
训练集,验证集,测试集都使用训练集训练的归一化
from sklearn.preprocessing import StandardScaler
#x_train : [None,28,28] -> [None,784] standardscaler传入的是二维数据
scaler = StandardScaler().fit_transform(x_train.reshape(-1,1)),reshape(-1,28,28)
文本数据的预处理
词粒度的文本长度
字粒度的文本长度
统计词的数量
2.特征工程
时间类的特征可以做成间隔型,也可以做出频度
比如离双十一若干天,比如离放假若干天;比如一周剁手多少次
文本特征的数据可以求n-gram,可以做词袋于统计词频,可以TF- IDF来看词的权重
统计型数据 是用来拿到数值相对总体的位置
sklearn特征抽取:
6.3. Preprocessing data — scikit-learn 0.24.2 documentation
API Reference — scikit-learn 0.24.2 documentation
数据量特别大,那逐类特征做scaling
特征的贡献优于模型
离散化
比如大于xx,为1;小于xx,为0
然后独热编码
enc = preprocessing.OneHotEncoder()
enc.fit(训练编码器的数据)
enc.transform(新数据).toarray()
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore') #训练数据
X = [['Male', 1], ['Female', 3], ['Female', 2]] #训练
enc.fit(X)enc.transform([['Female', 1], ['Male', 4]]).toarray() #测试数据
缺失值 数量比较少的用众数填充,适中的把缺省值作为独立的类型和其他类做one-hot,缺省值非常多直接放弃
特征的分布长尾且差距很大:可以推出这个特征对结果的贡献不一定很大,如果是连续值,要进行离散化
特征选择:通过特征变换产生了特别多特征就需要特征选择
recursive featyre elimination :把对最终结果贡献绝对值少的特征直接踢掉
feature selection using SelectFromModel:有个feature importance
特征融合:
gp1 = SymbolicTransformer(generations=1, population_size=1000,
hall_of_fame=600, n_components=100,
function_set=function_set,
parsimony_coefficient=0.0005,
max_samples=0.9, verbose=1,
random_state=0, n_jobs=3)
label = data['TARGET'] #换成标签
train = data.drop(columns=['TARGET']) #要融合的特征
train = data[['列名1','列名2']]
gp1.fit(train,label)
new_df2 = gp1.transform(train)
#可视化融合的特征
from IPython.display import Image
import pydotplus
graph = gp1._best_programs[0].export_graphviz()
graph = pydotplus.graphviz.graph_from_dot_data(graph)
Image(graph.create_png())
3·模型选择
4.寻找最佳超参数:交叉验证
cross-validation:通过多次测试的平均
确定哪个模型好
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear',C=1)
scores = cross_val_score(clf,X_train,y_train,cv=5)
scores
grid search
进一步确定模型的超参数
参数搜索
from scipy.stats import reciprocal
param_distribution={
"hidden_layers":[1,2,3,4],
"layer_size":np.arrange(1,100),
"learning_rate":reciprocal(1e-4,1e-2),
}
from sklearn.model_selection import RandomizedSearchCV
random_search_cv = RandomizedSearchCV(sklearn_model,param_distribution,n_iter=10,n_jobs = 5)
random_search_cv.fit(x_train,y_train,epochs = 100,validation_data =(x_valid,y_valid),callbacks=callbacks)
print(random_search_cv.best_params_) #获取最好的参数
print(random_search_cv.best_score_) #获取最好的结果
print(random_search_cv.best_estimator_) #获取最好的模型
5.模型分析与模型融合
学习曲线来判断overfitting或者underfitting
解决过拟合问题,增大样本的量是最有效的方式,增加正则化可能有用
stacking :将上级分类器的结果中作为下级分类器的特征,神经网络就是在干这样的事情
stacking脚本
"""Kaggle competition: Predicting a Biological Response.
Blending {RandomForests, ExtraTrees, GradientBoosting} + stretching to
[0,1]. The blending scheme is related to the idea Jose H. Solorzano
presented here:
http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950
'''You can try this: In one of the 5 folds, train the models, then use
the results of the models as 'variables' in logistic regression over
the validation data of that fold'''. Or at least this is the
implementation of my understanding of that idea :-)
The predictions are saved in test.csv. The code below created my best
submission to the competition:
- public score (25%): 0.43464
- private score (75%): 0.37751
- final rank on the private leaderboard: 17th over 711 teams :-)
Note: if you increase the number of estimators of the classifiers,
e.g. n_estimators=1000, you get a better score/rank on the private
test set.
Copyright 2012, Emanuele Olivetti.
BSD license, 3 clauses.
"""
from __future__ import division
import numpy as np
import load_data
from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
def logloss(attempt, actual, epsilon=1.0e-15):
"""Logloss, i.e. the score of the bioresponse competition.
"""
attempt = np.clip(attempt, epsilon, 1.0-epsilon)
return - np.mean(actual * np.log(attempt) +
(1.0 - actual) * np.log(1.0 - attempt))
if __name__ == '__main__':
np.random.seed(0) # seed to shuffle the train set
n_folds = 10
verbose = True
shuffle = False
X, y, X_submission = load_data.load()
if shuffle:
idx = np.random.permutation(y.size)
X = X[idx]
y = y[idx]
skf = list(StratifiedKFold(y, n_folds))
clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]
print "Creating train and test sets for blending."
dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))
for j, clf in enumerate(clfs):
print j, clf
dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))
for i, (train, test) in enumerate(skf):
print "Fold", i
X_tra