Weights Augmentation: it has never ever ever ever let her model down

Junbin Zhuang,Guiguang Din,Yunyi Yan
2024-05-30
Abstract:Weight play an essential role in deep learning network models. Unlike network structure design, this article proposes the concept of weight augmentation, focusing on weight exploration. The core of Weight Augmentation Strategy (WAS) is to adopt random transformed weight coefficients training and transformed coefficients, named Shadow Weight(SW), for networks that can be used to calculate loss function to affect parameter updates. However, stochastic gradient descent is applied to Plain Weight(PW), which is referred to as the original weight of the network before the random transformation. During training, numerous SW collectively form high-dimensional space, while PW is directly learned from the distribution of SW instead of the data. The weight of the accuracy-oriented mode(AOM) relies on PW, which guarantees the network is highly robust and accurate. The desire-oriented mode(DOM) weight uses SW, which is determined by the network model's unique functions based on WAT's performance desires, such as lower computational complexity, lower sensitivity to particular data, etc. The dual mode be switched at anytime if needed. WAT extends the augmentation technique from data augmentation to weight, and it is easy to understand and implement, but it can improve almost all networks amazingly. Our experimental results show that convolutional neural networks, such as VGG-16, ResNet-18, ResNet-34, GoogleNet, MobilementV2, and Efficientment-Lite, can benefit much at little or no cost. The accuracy of models is on the CIFAR100 and CIFAR10 datasets, which can be evaluated to increase by 7.32\% and 9.28\%, respectively, with the highest values being 13.42\% and 18.93\%, respectively. In addition, DOM can reduce floating point operations (FLOPs) by up to 36.33\%. The code is available at <a class="link-external link-https" href="https://github.com/zlearh/Weight-Augmentation-Technology" rel="external noopener nofollow">this https URL</a>.
Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is to improve the robustness and accuracy of deep - learning models while reducing computational complexity. Specifically, the author proposes a new training strategy - Weight Augmentation Strategy (WAS) - to improve the performance of models in different tasks. ### Main problems and solutions in the paper 1. **Problem description**: - The performance of deep - learning models depends on the optimization of weights. Traditional methods directly learn weights from data, which are prone to over - fitting and sensitivity to specific data. - Although existing data augmentation techniques improve the generalization ability of models, they mainly focus on the transformation of input data and ignore the exploration of the weight space. 2. **Solutions**: - **Weight Augmentation Strategy (WAS)**: Introduce randomly transformed weights (Shadow Weight, SW) to influence the loss function, thereby indirectly updating the original weights (Plain Weight, PW). SW is used to calculate the loss function, and PW is updated through Stochastic Gradient Descent (SGD). - **Dual - mode inference**: Propose two working modes - Accuracy - Oriented Mode (AOM) and Desire - Oriented Mode (DOM). AOM uses PW for inference to ensure high - precision of the model; DOM uses SW and can adjust model characteristics according to task requirements, such as reducing computational complexity or reducing sensitivity to specific data. ### Formula explanation - **Weight distribution mapping**: \[ \Psi : \mathbb{R}^m \rightarrow \mathbb{R}^n \] Here, \(\Psi\) represents the mapping from the feature space to the weight space, \(\mathbb{R}^m\) is the feature vector space, and \(\mathbb{R}^n\) is the weight vector space. - **Parameter update formula**: \[ \theta_j = \theta_j - \alpha \cdot \frac{\partial J(\theta)}{\partial \theta_j} \] \[ \theta_{pw,j} = \theta_{pw,j} - \alpha \cdot \frac{\partial J(\theta_{sw,j})}{\partial \theta_{pw,j}} \] Here, \(\theta_{sw,j}\) and \(\theta_{pw,j}\) respectively represent the \(j\) - th SW and PW parameters, \(\alpha\) is the learning rate, and \(J(\theta)\) is the loss function. - **Activation function application**: \[ h(x) = \text{ReLU}\left( W'x \right) = \text{ReLU}\left( \sum_{i = 1}^{k} T_i W_i x_i + b \right) \] Introducing randomness: \[ h(x) = \text{ReLU}\left( \sum_{i = 1}^{j} \gamma(T_i) W_i x_i + b \right) \] Here, \(\gamma\) represents the random transformation function. ### Experimental results The experimental results show that on the CIFAR - 10 and CIFAR - 100 datasets, the models using WAS have significantly improved accuracy on multiple network architectures (such as VGG - 16, ResNet18, ResNet34, GoogleNet, MobileNetV2, and EfficientNet - Lite), reaching 7.32% and 9.28% respectively, with the highest values being 13.42% and 18.93% respectively. In addition, the DOM mode can reduce the number of floating - point operations (FLOPs) by as much as 36.33%. ### Summary