4-bit Shampoo for Memory-Efficient Network Training

Sike Wang,Pan Zhou,Jia Li,Hua Huang
2024-10-27
Abstract:Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification and natural language modeling demonstrates that our 4-bit Shampoo achieves comparable performance to its 32-bit counterpart while being more memory-efficient.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: how to reduce the memory consumption of second - order optimizers while maintaining performance. Specifically, the authors proposed the 4 - bit Shampoo method, aiming to achieve this goal by quantizing the eigenvector matrix of the pre - conditioner matrix, so that 4 - bit Shampoo can achieve similar performance to 32 - bit Shampoo when training deep neural networks, while significantly reducing memory usage. ### Core issues of the paper 1. **Memory limitations**: As the model scale increases, the memory overhead required by second - order optimizers such as Shampoo becomes very large, which is a major obstacle to their wide application. 2. **Quantization challenges**: Although the state of first - order optimizers can be effectively compressed through quantization, second - order optimizers rely on matrix operations, so there are greater challenges in directly quantizing these matrix states. ### Solutions The authors proposed a new quantization method, that is, quantizing the eigenvector matrix of the pre - conditioner matrix instead of the pre - conditioner matrix itself. Specific contributions are as follows: - **Quantizing the eigenvector matrix**: By quantizing the eigenvector matrix of the pre - conditioner matrix, the accuracy of small singular values can be better maintained, thereby avoiding large errors that may be introduced when quantizing the pre - conditioner matrix. - **Orthogonal correction**: In order to further improve the quality of the quantized eigenvector matrix, the authors introduced the Björck orthogonalization method to correct the orthogonality of the quantized matrix. - **Quantization mapping selection**: Experiments show that linear square quantization is superior to dynamic tree quantization at 4 - bit precision. ### Experimental results Through experiments on multiple image classification and natural language processing tasks, the authors demonstrated that 4 - bit Shampoo can not only maintain performance comparable to 32 - bit Shampoo, but also significantly reduce memory footprint. For example, the experimental results on the CIFAR - 100 and ImageNet - 1k datasets show that the difference in test accuracy between 4 - bit Shampoo and 32 - bit Shampoo is not large, but 4 - bit Shampoo saves 4.5% to 41% in memory usage. ### Summary This paper solves the problem of high memory consumption of second - order optimizers through an innovative quantization method, providing a more efficient optimization tool for large - scale model training.