Abstract:The method introduced in this paper aims at helping deep learning practitioners faced with an overfit problem. The idea is to replace, in a multi-branch network, the standard summation of parallel branches with a stochastic affine combination. Applied to 3-branch residual networks, shake-shake regularization improves on the best single shot published results on CIFAR-10 and CIFAR-100 by reaching test errors of 2.86% and 15.85%. Experiments on architectures without skip connections or Batch Normalization show encouraging results and open the door to a large set of applications. Code is available at <a class="link-external link-https" href="https://github.com/xgastaldi/shake-shake" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the over - fitting problem of deep - learning models on small - scale datasets. Specifically, the author proposes a new method named Shake - Shake regularization, aiming to improve the generalization ability of the model by replacing the standard addition operation of parallel branches in multi - branch networks with a stochastic affine combination. ### Problem Background 1. **Successes and Challenges of Deep Residual Networks (ResNets)** - Deep Residual Networks (ResNets) have achieved remarkable success in multiple image recognition tasks, such as their performance in the ImageNet competition. - Although ResNets are powerful, they are still prone to over - fitting on small - scale datasets. 2. **Limitations of Existing Regularization Methods** - Existing regularization methods such as weight decay, early stopping, Dropout, etc., have alleviated the over - fitting problem to a certain extent. - Batch Normalization and Stochastic Gradient Descent (SGD) also have certain regularization effects, but their effects depend on the size of the mini - batch and gradient noise. 3. **Exploring New Regularization Methods** - Researchers have begun to explore regularization methods specific to multi - branch networks, such as randomly dropping certain information paths (drop - path). - The Shake - Shake regularization method is proposed based on this idea, aiming to enhance the generalization ability of the model by introducing randomness. ### Core Idea of Shake - Shake Regularization The main idea of Shake - Shake regularization is to use random coefficients to weight and combine the parallel branches in multi - branch networks during the training process, instead of simply adding them together. The specific formulas are as follows: - Let \(x_i\) be the tensor input to the \(i\)-th residual block. - \(W_i^{(1)}\) and \(W_i^{(2)}\) are the weights of two residual units respectively. - \(F\) represents the residual function, such as two 3x3 convolutional layers. - The standard forward - propagation formula is: \[ x_{i + 1}=x_i+F(x_i, W_i^{(1)})+F(x_i, W_i^{(2)}) \] - After using Shake - Shake regularization, the forward - propagation formula becomes: \[ x_{i + 1}=x_i+\alpha_iF(x_i, W_i^{(1)})+(1-\alpha_i)F(x_i, W_i^{(2)}) \] where \(\alpha_i\) is a random variable sampled from the uniform distribution \(U(0, 1)\). During testing, all \(\alpha_i\) are set to their expected value of 0.5. ### Experimental Results Experiments show that Shake - Shake regularization significantly improves the performance of the model on the CIFAR - 10 and CIFAR - 100 datasets, achieving the best single - run results at that time. Specifically: - On CIFAR - 10, a test error of 2.86% is achieved. - On CIFAR - 100, a test error of 15.85% is achieved. In addition, this method also shows good performance in architectures without using skip connections or Batch Normalization, which provides the possibility for further expanding its application. ### Summary This paper effectively solves the over - fitting problem of deep - learning models on small - scale datasets by introducing the Shake - Shake regularization method and achieves excellent performance on multiple benchmark datasets. This method is not only applicable to ResNets but may also be extended to other types of neural network architectures.

Shake-Shake regularization

Shakedrop Regularization for Deep Residual Learning

Shakeout: A New Approach to Regularized Deep Neural Network Training

SparseConnect: Regularising CNNs on Fully Connected Layers

PatchShuffle Regularization.

Weight-Sharing Regularization

Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Extremely Simple Activation Shaping for Out-of-Distribution Detection

Regularizing Deep Convolutional Neural Networks with a Structured Decorrelation Constraint.

Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations

Alleviating Representational Shift for Continual Fine-tuning.

Convolutional Neural Networks With Dynamic Regularization

Variance-Covariance Regularization Improves Representation Learning

REVE: Regularizing Deep Learning with Variational Entropy Bound

Stochastic Normalization.

Swapout: Learning an ensemble of deep architectures

Patch-level Neighborhood Interpolation: A General and Effective Graph-based Regularization Strategy

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

Drop-Activation: Implicit Parameter Reduction and Harmonic Regularization

Class Regularization: Improve Few-shot Image Classification by Reducing Meta Shift