Bagging and Other Ensemble Methods
Bagging (short for bootstrap aggregating) is a technique for reducing generalization error by combining several models.
The idea is to train several different models separately, then have all of the models vote on the output for test examples.
This is an example of a general strategy in machine learning called model averaging. Techniques employing this strategy are known as ensemble methods.
The reason that model averaging works is that different models will usually not make all the same errors on the test set.
Consider for example a set of k regression models. Suppose that each model makes an error ϵi on each example, with the errors drawn from a zero-mean multivariate normal distribution with variances E[ϵ2i]=v and covariances E[ϵiϵj]=c. Then the error made by the average prediction of all the ensemble models is 1k∑iϵi. The expected squared error of the ensemble predictor is
- In the case where the errors are perfectly correlated and c = v, the mean squared error reduces to v, so the model averaging does not help at all.
- In the case where the errors are perfectly uncorrelated and c = 0, the expected squared error of the ensemble is only 1kv.
- This means that the expected squared error of the ensemble decreases linearly with the ensemble size.
- In other words, on average, the ensemble will perform at least as well as any of its members, and if the members make independent errors, the ensemble will perform significantly better than its members.
Different ensemble methods construct the ensemble of models in different ways. For example, each member of the ensemble could be formed by training a completely different kind of model using a different algorithm or objective function.
Bagging is a method that allows the same kind of model, training algorithm and objective function to be reused several times.
Specifically, bagging involves constructing k different datasets. Each dataset has the same number of examples as the original dataset, but each dataset is constructed by sampling with replacement from the original dataset. This means that, with high probability, each dataset is missing some of the examples from the original dataset and also contains several duplicate examples. Model i is then trained on dataset i. The differences between which examples are included in each dataset result in differences between the trained models.
- Neural networks reach a wide enough variety of solution points that they can often benefit from model averaging even if all of the models are trained on the same dataset.
- Differences in random initialization, random selection of minibatches, differences in hyperparameters, or different outcomes of non-deterministic implementations of neural networks are often enough to cause different members of the ensemble to make partially independent errors.
Model averaging is an extremely powerful and reliable method for reducing generalization error.