How to Ensemble Neural Network Models

本文探讨了深度学习中集成方法的应用,包括通过改变训练数据、模型和预测组合来提高模型稳定性和性能的策略。文章详细介绍了随机拆分、交叉验证、Bagging集成、快照集成、水平投票集成等方法,旨在帮助读者理解如何有效利用集成学习提升深度学习模型的泛化能力。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

ByJason BrownleeinDeep Learning Performance

1. How to Ensemble Neural Network Models

The field of ensemble learning is well studied and there are many variations on this simple theme.

It can be helpful to think of varying each of the three major elements of the ensemble method; for example:

  • Training Data : Vary the choice of data used to train each model in the ensemble.
  • Ensemble Models : Vary the choice of the models used in the ensemble.
  • Combinations : Vary the choice of the way that outcomes from ensemble members are combined.

Varying Training Data

Ensemble Tutorials

For examples of deep learning ensembles that vary training data see:

Varying Models

Ensemble Tutorials

For examples of deep learning ensembles that vary models see:

Varying Combinations

Ensemble Tutorials

For examples of deep learning ensembles that vary combinations see:

Summary of Ensemble Techniques

In summary, we can list some of the more common and interesting ensemble methods for neural networks organized by each element of the method that can be varied, as follows:

There is no single best ensemble method; perhaps experiment with a few approaches or let the constraints of your project guide you.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Books

Papers

Articles

Summary

In this post, you discovered ensemble methods for deep learning neural networks to reduce variance and improve prediction performance.

Specifically, you learned:

  • Neural network models are nonlinear and have a high variance, which can be frustrating when preparing a final model for making predictions.
  • Ensemble learning combines the predictions from multiple neural network models to reduce the variance of predictions and reduce generalization error.
  • Techniques for ensemble learning can be grouped by the element that is varied, such as training data, the model, and how predictions are combined.

2. How to Develop a Random-Split, Cross-Validation, and Bagging Ensemble for Deep Learning

Combining the predictions from multiple models can result in more stable predictions, and in some cases, predictions that have better performance than any of the contributing models.

Effective ensembles require members that disagree. Each member must have skill (e.g. perform better than random chance), but ideally, perform well in different ways. Technically, we can say that we prefer ensemble members to have low correlation in their predictions, or prediction errors.

One approach to encourage differences between ensembles is to use the same learning algorithm on different training datasets. This can be achieved by repeatedly resampling a training dataset that is in turn used to train a new model. Multiple models are fit using slightly different perspectives on the training data and, in turn, make different errors and often more stable and better predictions when combined.

We can refer to these methods generally as data resampling ensembles.

A benefit of this approach is that resampling methods may be used that do not make use of all examples in the training dataset. Any examples that are not used to fit the model can be used as a test dataset to estimate the generalization error of the chosen model configuration.

There are three popular resampling methods that we could use to create a resampling ensemble; they are:

  • Random Splits. The dataset is repeatedly sampled with a random split of the data into train and test sets.
  • k-fold Cross-Validation. The dataset is split into k equally sized folds, k models are trained and each fold is given an opportunity to be used as the holdout set where the model is trained on all remaining folds.
  • Bootstrap Aggregation. Random samples are collected with replacement and examples not included in a given sample are used as the test set.

Perhaps the most widely used resampling ensemble method is bootstrap aggregation, more commonly referred to as bagging. The resampling with replacement allows more difference in the training dataset, biasing the model and, in turn, resulting in more difference between the predictions of the resulting models.

Resampling ensemble models makes some specific assumptions about your project:

  • That a robust estimate of model performance on unseen data is required; if not, then a single train/test split can be used.
  • That there is a potential for a lift in performance using an ensemble of models; if not, then a single model fit on all available data can be used.
  • That the computational cost of fitting more than one neural network model on a sample of the training dataset is not prohibitive; if not, all resources should be put into fitting a single model.

Neural network models are remarkably flexible, therefore the lift in performance provided by a resampling ensemble is not always possible given that individual models trained on all available data can perform so well.

As such, the sweet spot for using a resampling ensemble is the case where there is a requirement for a robust estimate of performance and multiple models can be fit to calculate the estimate, but there is also a requirement for one (or more) of the models created during the estimate of performance to be used as the final model (e.g. a new final model cannot be fit on all available training data).

Random Splits Ensemble

The instability of the model and the small test dataset mean that we don't really know how well this model will perform on new data in general.

We can try a simple resampling method of repeatedly generating new random splits of the dataset in train and test sets and fit new models. Calculating the average of the performance of the model across each split will give a better estimate of the model's generalization error.

We can then combine multiple models trained on the random splits with the expectation that performance of the ensemble is likely to be more stable and better than the average single model.

Cross-Validation Ensemble

A problem with repeated random splits as a resampling method for estimating the average performance of model is that it is optimistic.

An approach designed to be less optimistic and is widely used as a result is the k-fold cross-validation method.

The method is less biased because each example in the dataset is only used one time in the test dataset to estimate model performance, unlike random train-test splits where a given example may be used to evaluate a model many times.

The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. The average of the scores of each model provides a less biased estimate of model performance. A typical value for k is 10.

Because neural network models are computationally very expensive to train, it is common to use the best performing model during cross-validation as the final model.

Alternately, the resulting models from the cross-validation process can be combined to provide a cross-validation ensemble that is likely to have better performance on average than a given single model.

Bagging Ensemble

A limitation of random splits and k-fold cross-validation from the perspective of ensemble learning is that the models are very similar.

The bootstrap method is a statistical technique for estimating quantities about a population by averaging estimates from multiple small data samples.

Importantly, samples are constructed by drawing observations from a large data sample one at a time and returning them to the data sample after they have been chosen. This allows a given observation to be included in a given small sample more than once. This approach to sampling is called sampling with replacement.

The method can be used to estimate the performance of neural network models. Examples not selected in a given sample can be used as a test set to estimate the performance of the model.

The bootstrap is a robust method for estimating model performance. It does suffer a little from an optimistic bias, but is often almost as accurate as k-fold cross-validation in practice.

The benefit for ensemble learning is that each model is that each data sample is biased, allowing a given example to appear many times in the sample. This, in turn, means that the models trained on those samples will be biased, importantly in different ways. The result can be ensemble predictions that can be more accurate.

Generally, use of the bootstrap method in ensemble learning is referred to as bootstrap aggregation or bagging.

We can use the resample() function from scikit-learn to select a subsample with replacement. The function takes an array to subsample and the size of the resample as arguments.

3. How to Develop a Snapshot Ensemble Deep Learning Neural Network

A problem with ensemble learning with deep learning methods is the large computational cost of training multiple models.

This is because of the use of very deep models and very large datasets that can result in model training times that may extend to days, weeks, or even months.

"Despite its obvious advantages, the use of ensembling for deep networks is not nearly as widespread as it is for other algorithms. One likely reason for this lack of adaptation may be the cost of learning multiple neural networks. Training deep networks can last for weeks, even on high performance hardware with GPU acceleration."

Snapshot Ensembles: Train 1, get M for free, 2017.

One approach to ensemble learning for deep learning neural networks is to collect multiple models from a single training run. This addresses the computational cost of training multiple deep learning models as models can be selected and saved during training, then used to make an ensemble prediction.

A key benefit of ensemble learning is in improved performance compared to the predictions from single models. This can be achieved through the selection of members that have good skill, but in different ways, providing a diverse set of predictions to be combined. A limitation of collecting multiple models during a single training run is that the models may be good, but too similar.

This can be addressed by changing the learning algorithm for the deep neural network to force the exploration of different network weights during a single training run that will result, in turn, with models that have differing performance. One way that this can be achieved is by aggressively changing the learning rate used during training.

An approach to systematically and aggressively changing the learning rate during training to result in very different network weights is referred to as "Stochastic Gradient Descent with Warm Restarts" or SGDR for short, described by Ilya Loshchilov and Frank Hutter in their 2017 paper "SGDR: Stochastic Gradient Descent with Warm Restarts."

Their approach involves systematically changing the learning rate over training epochs, called cosine annealing. This approach requires the specification of two hyperparameters: the initial learning rate and the total number of training epochs.

The "cosine annealing" method has the effect of starting with a large learning rate that is relatively rapidly decreased to a minimum value before being dramatically increased again. The model weights are subjected to the dramatic changes during training, having the effect of using "good weights" as the starting point for the subsequent learning rate cycle, but allowing the learning algorithm to converge to a different solution.

The resetting of the learning rate acts like a simulated restart of the learning process and the re-use of good weights as the starting point of the restart is referred to as a "warm restart," in contrast to a "cold restart" where a new set of small random numbers may be used as a starting point.

The "good weights" at the bottom of each cycle can be saved to file, providing a snapshot of the model. These snapshots can be collected together at the end of the run and used in a model averaging ensemble. The saving and use of these models during an aggressive learning rate schedule is referred to as a "Snapshot Ensemble" and was described by Gao Huang, et al. in their 2017 paper titled "Snapshot Ensembles: Train 1, get M for free" and subsequently also used in an updated version of the Loshchilov and Hutter paper.

"… we let SGD converge M times to local minima along its optimization path. Each time the model converges, we save the weights and add the corresponding network to our ensemble. We then restart the optimization with a large learning rate to escape the current local minimum."

Snapshot Ensembles: Train 1, get M for free, 2017.

The ensemble of models is created during the course of training a single model, therefore, the authors claim that the ensemble forecast is provided at no additional cost.

"[the approach allows] learning an ensemble of multiple neural networks without incurring any additional training costs."

Snapshot Ensembles: Train 1, get M for free, 2017.

Although a cosine annealing schedule is used for the learning rate, other aggressive learning rate schedules could be used, such as the simpler cyclical learning rate schedule described by Leslie Smith in the 2017 paper titled "Cyclical Learning Rates for Training Neural Networks."

4. How to Develop a Horizontal Voting Deep Learning Ensemble

A challenge when using ensemble learning when using deep learning methods is that given the use of very large datasets and large models, a given training run may take days, weeks, or even months. Training multiple models may not be feasible.

An alternative source of models that may contribute to an ensemble are the state of a single model at different points during training.

Horizontal voting is an ensemble method proposed by Jingjing Xie, et al. in their 2013 paper "Horizontal and Vertical Ensemble with Deep Representation for Classification."

The method involves using multiple models from the end of a contiguous block of epochs before the end of training in an ensemble to make predictions.

The approach was developed specifically for those predictive modeling problems where the training dataset is relatively small compared to the number of predictions required by the model. This results in a model that has a high variance in performance during training. In this situation, using the final model or any given model toward the end of the training process is risky given the variance in performance.

"… the error rate of classification would first decline and then tend to be stable with the training epoch grows. But when size of labeled training set is too small, the error rate would oscillate […] So it is difficult to choose a "magic" epoch to obtain a reliable output."

Horizontal and Vertical Ensemble with Deep Representation for Classification, 2013.

Instead, the authors suggest using all of the models in an ensemble from a contiguous block of epochs during training, such as models from the last 200 epochs. The result are predictions by the ensemble that are as good as or better than any single model in the ensemble.

"To reduce the instability, we put forward a method called Horizontal Voting. First, networks trained for a relatively stable range of epoch are selected. The predictions of the probability of each label are produced by standard classifiers with top level representation of the selected epoch, and then averaged."

Horizontal and Vertical Ensemble with Deep Representation for Classification, 2013.

As such, the horizontal voting ensemble method provides an ideal method for both cases where a given model requires vast computational resources to train, and/or cases where final model selection is challenging given the high variance of training due to the use of a relatively small training dataset.

### Knowledge Distillation Overview in Machine Learning Knowledge distillation is an approach where a smaller model, known as the student network, learns from a larger and more complex model or ensemble of models, referred to as the teacher network. The goal is not only to transfer knowledge but also to achieve better performance with less computational cost. In deep learning applications such as computer vision, this technique has been widely adopted due to its effectiveness in compressing large neural networks while maintaining high accuracy levels[^1]. The process involves training the student using both hard labels provided by original datasets alongside soft targets generated through temperature scaling applied on logits produced by teachers during forward passes. This allows students to mimic behaviors exhibited by their mentors effectively without requiring access directly thereto when deployed later independently at inference time. Moreover, lifelong learning systems can benefit significantly from incorporating principles similar to those found within knowledge distillation frameworks since they aim towards continuous adaptation over extended periods rather than single-task optimization scenarios typically encountered otherwise elsewhere inside traditional supervised settings alone [^2]. Recent advancements have explored self-generated context-based methods leveraging auto-regressive language models capable of generating demonstrations autonomously; these innovations open new possibilities for enhancing how machines learn across diverse domains beyond just classification tasks traditionally associated most closely up until now primarily so far mainly within artificial intelligence research communities today currently existing presently available right now [^3]. Explanation properties like generalizability ensure that distilled knowledge remains applicable outside specific instances seen during training phases, which contributes positively toward building robust AI solutions resilient against unseen data points after deployment into real-world environments post-training completion events occur eventually ultimately finally [^4]. ```python import torch.nn.functional as F def distill_loss(student_output, teacher_output, target, T=5): """Custom loss function combining cross entropy and KL divergence.""" # Standard Cross Entropy Loss between Student Output & True Labels ce_loss = F.cross_entropy(student_output, target) # Softened Predictions via Temperature Scaling Applied To Teacher Outputs Before Computing KLDivLoss With Corresponding Values From Students After Applying Same Transformation Process On Both Ends Respectively Appropriately Accordingly Suitably Adequately Properly Correctly Precisely Exactly. kd_loss = F.kl_div( F.log_softmax(student_output / T, dim=1), F.softmax(teacher_output / T, dim=1), reduction='batchmean' ) * (T ** 2) return ce_loss + kd_loss ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值