Shake-Shake regularization

Xavier Gastaldi
DOI: https://doi.org/10.48550/arXiv.1705.07485
2017-05-23
Abstract:The method introduced in this paper aims at helping deep learning practitioners faced with an overfit problem. The idea is to replace, in a multi-branch network, the standard summation of parallel branches with a stochastic affine combination. Applied to 3-branch residual networks, shake-shake regularization improves on the best single shot published results on CIFAR-10 and CIFAR-100 by reaching test errors of 2.86% and 15.85%. Experiments on architectures without skip connections or Batch Normalization show encouraging results and open the door to a large set of applications. Code is available at <a class="link-external link-https" href="https://github.com/xgastaldi/shake-shake" rel="external noopener nofollow">this https URL</a>
Machine Learning,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the over - fitting problem of deep - learning models on small - scale datasets. Specifically, the author proposes a new method named Shake - Shake regularization, aiming to improve the generalization ability of the model by replacing the standard addition operation of parallel branches in multi - branch networks with a stochastic affine combination. ### Problem Background 1. **Successes and Challenges of Deep Residual Networks (ResNets)** - Deep Residual Networks (ResNets) have achieved remarkable success in multiple image recognition tasks, such as their performance in the ImageNet competition. - Although ResNets are powerful, they are still prone to over - fitting on small - scale datasets. 2. **Limitations of Existing Regularization Methods** - Existing regularization methods such as weight decay, early stopping, Dropout, etc., have alleviated the over - fitting problem to a certain extent. - Batch Normalization and Stochastic Gradient Descent (SGD) also have certain regularization effects, but their effects depend on the size of the mini - batch and gradient noise. 3. **Exploring New Regularization Methods** - Researchers have begun to explore regularization methods specific to multi - branch networks, such as randomly dropping certain information paths (drop - path). - The Shake - Shake regularization method is proposed based on this idea, aiming to enhance the generalization ability of the model by introducing randomness. ### Core Idea of Shake - Shake Regularization The main idea of Shake - Shake regularization is to use random coefficients to weight and combine the parallel branches in multi - branch networks during the training process, instead of simply adding them together. The specific formulas are as follows: - Let \(x_i\) be the tensor input to the \(i\)-th residual block. - \(W_i^{(1)}\) and \(W_i^{(2)}\) are the weights of two residual units respectively. - \(F\) represents the residual function, such as two 3x3 convolutional layers. - The standard forward - propagation formula is: \[ x_{i + 1}=x_i+F(x_i, W_i^{(1)})+F(x_i, W_i^{(2)}) \] - After using Shake - Shake regularization, the forward - propagation formula becomes: \[ x_{i + 1}=x_i+\alpha_iF(x_i, W_i^{(1)})+(1-\alpha_i)F(x_i, W_i^{(2)}) \] where \(\alpha_i\) is a random variable sampled from the uniform distribution \(U(0, 1)\). During testing, all \(\alpha_i\) are set to their expected value of 0.5. ### Experimental Results Experiments show that Shake - Shake regularization significantly improves the performance of the model on the CIFAR - 10 and CIFAR - 100 datasets, achieving the best single - run results at that time. Specifically: - On CIFAR - 10, a test error of 2.86% is achieved. - On CIFAR - 100, a test error of 15.85% is achieved. In addition, this method also shows good performance in architectures without using skip connections or Batch Normalization, which provides the possibility for further expanding its application. ### Summary This paper effectively solves the over - fitting problem of deep - learning models on small - scale datasets by introducing the Shake - Shake regularization method and achieves excellent performance on multiple benchmark datasets. This method is not only applicable to ResNets but may also be extended to other types of neural network architectures.