The main idea is to abstract more information from the receptive field.
“As in CNN, filters from higher layers map to larger regions in original input. It generates a higher level concept by combining the lower level concepts from the layer below. Therefore, we argue that it would be beneficial to do a better abstraction on each local patch,before combining them into higher level concepts."
The method is to replace the GLM (generalized linear model) with a MLP (multilayer perceptron). Previously, CNN prefers to use linear filters followed by a nonlinear activation function. But this structure may impose some prior on the input data. In this article, the author use multilayer perceptron to extract feature on the local receptive field.
“Maxout Network imposes the prior that instances of a latent concept lie within a convex set in the input space.”
“Given no priors about the distributions of the latent concepts, it is desirable to use a universal function approximator for feature extraction of the local patches, as it is capable of approximating more abstract representations of the latent concepts.”
Another trick to prevent overfitting is, in this article, using global average pooling instead of fully connected layer.
“One advantage of global average pooling over the fully connected layers is that itis more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantages is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.”