机器学习深度学习实战模板代码（持续更新）

最新推荐文章于 2025-04-18 20:28:27 发布

weixin_45955767

最新推荐文章于 2025-04-18 20:28:27 发布

阅读量2k

点赞数 1

分类专栏：比赛模板代码文章标签：深度学习机器学习人工智能

本文链接：https://blog.youkuaiyun.com/weixin_45955767/article/details/120103002

版权

本文详细介绍了机器学习和深度学习的实战过程，包括建模与问题解决流程，如数据处理、特征工程、模型选择和超参数优化。讨论了数据预处理的各个环节，如数据清洗、数据采样和归一化。此外，还涵盖了文本预处理、图像数据扩增、特征选择和模型融合策略。最后，列举了多种可视化方法和模型，以及在不同分类、回归任务中的应用，如时间序列分析、推荐系统和自然语言处理。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

赛题的分类：

太阳底下无新事，都是出现过的赛题只是换了场景和数据

建模与问题解决流程

了解场景和目标
了解评估准则

1.数据处理

数据清洗

观察数据是否平衡比如广告点击不点才是大概率

需要清除离心点，不然会破坏模型性能

不可信的样本丢掉（需要人根据常识判断）

缺省值极多的字段考虑不用

数据采样

数据采样：上下采样和保证样本均衡（1：3差不多，1：10差距太大）

数据归一化

训练集，验证集，测试集都使用训练集训练的归一化

from sklearn.preprocessing import StandardScaler
#x_train : [None,28,28] -> [None,784]    standardscaler传入的是二维数据
scaler = StandardScaler().fit_transform(x_train.reshape(-1,1)),reshape(-1,28,28)

文本数据的预处理

词粒度的文本长度

字粒度的文本长度

统计词的数量

2.特征工程

时间类的特征可以做成间隔型，也可以做出频度

比如离双十一若干天，比如离放假若干天；比如一周剁手多少次

文本特征的数据可以求n-gram,可以做词袋于统计词频，可以TF- IDF来看词的权重

统计型数据是用来拿到数值相对总体的位置

sklearn特征抽取：

6.3. Preprocessing data — scikit-learn 0.24.2 documentation

API Reference — scikit-learn 0.24.2 documentation

数据量特别大，那逐类特征做scaling

特征的贡献优于模型

离散化

比如大于xx,为1；小于xx，为0

然后独热编码

enc = preprocessing.OneHotEncoder()
enc.fit(训练编码器的数据)
enc.transform(新数据).toarray()

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')  #训练数据
X = [['Male', 1], ['Female', 3], ['Female', 2]] #训练
enc.fit(X)enc.transform([['Female', 1], ['Male', 4]]).toarray() #测试数据

缺失值数量比较少的用众数填充，适中的把缺省值作为独立的类型和其他类做one-hot，缺省值非常多直接放弃

特征的分布长尾且差距很大：可以推出这个特征对结果的贡献不一定很大，如果是连续值，要进行离散化

特征选择：通过特征变换产生了特别多特征就需要特征选择

recursive featyre elimination ：把对最终结果贡献绝对值少的特征直接踢掉

feature selection using SelectFromModel:有个feature importance

特征融合：

gp1 = SymbolicTransformer(generations=1, population_size=1000,
                              hall_of_fame=600, n_components=100,
                              function_set=function_set,
                              parsimony_coefficient=0.0005,
                              max_samples=0.9, verbose=1,
                              random_state=0, n_jobs=3)


label = data['TARGET']  #换成标签
train = data.drop(columns=['TARGET'])  #要融合的特征
train = data[['列名1','列名2']]

gp1.fit(train,label)
new_df2 = gp1.transform(train)


#可视化融合的特征
from IPython.display import Image
import pydotplus
graph = gp1._best_programs[0].export_graphviz()
graph = pydotplus.graphviz.graph_from_dot_data(graph)
Image(graph.create_png())

3·模型选择
4.寻找最佳超参数：交叉验证

cross-validation：通过多次测试的平均

确定哪个模型好

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear',C=1)
scores = cross_val_score(clf,X_train,y_train,cv=5)
scores

grid search

进一步确定模型的超参数

参数搜索

from scipy.stats import reciprocal
param_distribution={
    "hidden_layers":[1,2,3,4],
    "layer_size":np.arrange(1,100),
    
    "learning_rate":reciprocal(1e-4,1e-2),
}
from sklearn.model_selection import RandomizedSearchCV
random_search_cv = RandomizedSearchCV(sklearn_model,param_distribution,n_iter=10,n_jobs = 5)
random_search_cv.fit(x_train,y_train,epochs = 100,validation_data =(x_valid,y_valid),callbacks=callbacks)

print(random_search_cv.best_params_) #获取最好的参数

print(random_search_cv.best_score_) #获取最好的结果

print(random_search_cv.best_estimator_) #获取最好的模型

5.模型分析与模型融合

学习曲线来判断overfitting或者underfitting

解决过拟合问题，增大样本的量是最有效的方式，增加正则化可能有用

stacking ：将上级分类器的结果中作为下级分类器的特征，神经网络就是在干这样的事情

stacking脚本

"""Kaggle competition: Predicting a Biological Response.

Blending {RandomForests, ExtraTrees, GradientBoosting} + stretching to
[0,1]. The blending scheme is related to the idea Jose H. Solorzano
presented here:
http://www.kaggle.com/c/bioresponse/forums/t/1889/question-about-the-process-of-ensemble-learning/10950#post10950
'''You can try this: In one of the 5 folds, train the models, then use
the results of the models as 'variables' in logistic regression over
the validation data of that fold'''. Or at least this is the
implementation of my understanding of that idea :-)

The predictions are saved in test.csv. The code below created my best
submission to the competition:
- public score (25%): 0.43464
- private score (75%): 0.37751
- final rank on the private leaderboard: 17th over 711 teams :-)

Note: if you increase the number of estimators of the classifiers,
e.g. n_estimators=1000, you get a better score/rank on the private
test set.

Copyright 2012, Emanuele Olivetti.
BSD license, 3 clauses.
"""

from __future__ import division
import numpy as np
import load_data
from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression


def logloss(attempt, actual, epsilon=1.0e-15):
    """Logloss, i.e. the score of the bioresponse competition.
    """
    attempt = np.clip(attempt, epsilon, 1.0-epsilon)
    return - np.mean(actual * np.log(attempt) +
                     (1.0 - actual) * np.log(1.0 - attempt))


if __name__ == '__main__':

    np.random.seed(0)  # seed to shuffle the train set

    n_folds = 10
    verbose = True
    shuffle = False

    X, y, X_submission = load_data.load()

    if shuffle:
        idx = np.random.permutation(y.size)
        X = X[idx]
        y = y[idx]

    skf = list(StratifiedKFold(y, n_folds))

    clfs = [RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
            RandomForestClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
            ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='gini'),
            ExtraTreesClassifier(n_estimators=100, n_jobs=-1, criterion='entropy'),
            GradientBoostingClassifier(learning_rate=0.05, subsample=0.5, max_depth=6, n_estimators=50)]

    print "Creating train and test sets for blending."

    dataset_blend_train = np.zeros((X.shape[0], len(clfs)))
    dataset_blend_test = np.zeros((X_submission.shape[0], len(clfs)))

    for j, clf in enumerate(clfs):
        print j, clf
        dataset_blend_test_j = np.zeros((X_submission.shape[0], len(skf)))
        for i, (train, test) in enumerate(skf):
            print "Fold", i
            X_tra