Abstract:The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the issue of high memory consumption during the training and inference of neural networks. Specifically, as the number of neural network parameters increases, although model performance improves, the available memory on devices becomes a limiting factor. Existing compression techniques such as quantization can alleviate this issue but often lead to performance degradation. Therefore, the authors propose NeuZip, a weight compression scheme based on floating-point entropy, which achieves efficient memory management without sacrificing performance.
### Main Contributions
1. **Memory-Efficient Training and Inference**:
- NeuZip significantly reduces memory consumption during training and inference by compressing floating-point weights in neural networks.
- During training, NeuZip can reduce the memory consumption of the Llama-3 8B model from 31GB to below 16GB while maintaining training dynamics.
- During inference, NeuZip can reduce memory consumption by more than half while maintaining nearly lossless performance.
2. **Lossless and Lossy Compression**:
- **Lossless Compression**: Achieved by compressing exponent bits, suitable for the training process.
- **Lossy Compression**: During inference, further reduces memory consumption by truncating mantissa bits while maintaining high performance.
3. **Compatibility**:
- NeuZip is compatible with existing memory optimization techniques (such as activation checkpointing), further enhancing memory efficiency.
### Experimental Results
1. **Pre-training Experiments**:
- Pre-training experiments were conducted on Transformer models of different sizes, including GPT-Neo 2.7B, Llama-3 8B, and LLama-2 13B.
- Results show that NeuZip significantly reduces memory consumption while maintaining the same performance, particularly reducing memory consumption from 26.26GB to 18.58GB on the LLama-2 13B model.
2. **Fine-tuning Experiments**:
- Fine-tuning experiments were conducted on the T5 model for the SQL generation task, including T5 1B, T5 3B, and T5 11B.
- Results indicate that NeuZip significantly reduces memory consumption while maintaining the same BLEU score, particularly reducing memory consumption from 25.95GB to 20.68GB on the T5 11B model.
3. **Lossy Compression Experiments**:
- The performance of lossy NeuZip was evaluated on language modeling tasks, including decoder models (LLama-3 8B, LLama-2 13B, Yi-1.5 34B) and encoder-decoder models (T5 1B, T5 3B, T5 11B).
- Results show that lossy NeuZip significantly reduces memory consumption while maintaining high performance, with 3-bit mantissa retention achieving nearly lossless performance in all experiments.
### Conclusion
NeuZip effectively addresses the issue of memory consumption during neural network training and inference through lossless and lossy compression techniques while maintaining high model performance. This provides a new solution for the training and deployment of large-scale neural networks.