Robust Training of Neural Networks at Arbitrary Precision and Sparsity

Chengxi Ye,Grace Chu,Yanfeng Liu,Yichi Zhang,Lukasz Lew,Andrew Howard
2024-09-14
Abstract:The discontinuous operations inherent in quantization and sparsification introduce obstacles to backpropagation. This is particularly challenging when training deep neural networks in ultra-low precision and sparse regimes. We propose a novel, robust, and universal solution: a denoising affine transform that stabilizes training under these challenging conditions. By formulating quantization and sparsification as perturbations during training, we derive a perturbation-resilient approach based on ridge regression. Our solution employs a piecewise constant backbone model to ensure a performance lower bound and features an inherent noise reduction mechanism to mitigate perturbation-induced corruption. This formulation allows existing models to be trained at arbitrarily low precision and sparsity levels with off-the-shelf recipes. Furthermore, our method provides a novel perspective on training temporal binary neural networks, contributing to ongoing efforts to narrow the gap between artificial and biological neural networks.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Numerical Analysis
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the challenges encountered when training neural networks under extremely low - precision and sparse conditions. Specifically, quantization and sparsification operations introduce discontinuities, which pose obstacles to backpropagation, especially in cases of extremely low precision and high sparsity, making it particularly difficult to train deep neural networks. #### Main problems: 1. **Challenges of discontinuous operations**: - Quantization and sparsification operations introduce non - continuous operations such as rounding and hard - thresholding, which are incompatible with the differentiable design required for backpropagation. - This incompatibility makes it difficult for the training algorithm to converge and may even cause the training process to diverge. 2. **Limitations of existing methods**: - Existing methods such as the Straight - Through Estimator (STE) can define the gradients of non - continuous operations to a certain extent, but may still lead to training instability at low precision. - Most existing methods require a large number of adjustments to the model architecture or training strategy, limiting their flexibility and universality. 3. **The need for efficient neural network training**: - With the increase in the scale and complexity of generative AI models, computational efficiency has become the focus of research. Quantization and sparsification are two classic methods for achieving computational efficiency, but their wide application is limited by the above problems. ### Solutions proposed in the paper: To solve these problems, the paper proposes a new, robust and universal solution: stabilizing the training process through denoising affine transformation. The specific steps are as follows: 1. **Affine transformation for quantization**: - Use min - max scaling to convert the floating - point vector \(x\) to the target range (for example, \([0, 2^{\text{bits}}- 1]\)), and apply an affine transformation to ensure signal fidelity. 2. **Perturbation injection**: - Model the quantization operation as introducing a bounded perturbation \(\delta\), that is, \(q = f(x)+\delta\), where \(\delta=\text{round}(f(x)) - f(x)\). - This method avoids empirical operations (such as clipping), thus maintaining the integrity of the signal. 3. **Denoising affine transformation for reconstruction**: - Introduce another affine transformation \(g\) to effectively reconstruct the original signal and suppress quantization noise. - Through the ridge regression formula: \[ \min_{a,b}\frac{1}{2N}\|a\cdot q + b - x\|^2+\frac{\lambda}{2}a^2 \] where \(N\) is the dimension / length of \(x\), and \(\lambda\) is a regularization factor. 4. **Signal decomposition and denoising**: - Decompose the quantized signal into a smooth part \(x\) and a non - smooth part \(a(q-\bar{q})\), and control the balance between signal and noise during the training process by adjusting \(\lambda\). 5. **Extension to sparsification**: - Similarly, handle the sparsification operation, model it as introducing a perturbation \(\delta = H(x)-x\), and preserve the integrity of the signal through the average value in extremely sparse cases. ### Summary: By introducing denoising affine transformation, the paper provides a robust and universal method that can stabilize the training of neural networks under extremely low - precision and high - sparsity conditions. This method not only simplifies the training process but also improves model performance and is applicable to multiple model architectures and tasks.