Abstract:The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the issue of high memory consumption during the training and inference of neural networks. Specifically, as the number of neural network parameters increases, although model performance improves, the available memory on devices becomes a limiting factor. Existing compression techniques such as quantization can alleviate this issue but often lead to performance degradation. Therefore, the authors propose NeuZip, a weight compression scheme based on floating-point entropy, which achieves efficient memory management without sacrificing performance. ### Main Contributions 1. **Memory-Efficient Training and Inference**: - NeuZip significantly reduces memory consumption during training and inference by compressing floating-point weights in neural networks. - During training, NeuZip can reduce the memory consumption of the Llama-3 8B model from 31GB to below 16GB while maintaining training dynamics. - During inference, NeuZip can reduce memory consumption by more than half while maintaining nearly lossless performance. 2. **Lossless and Lossy Compression**: - **Lossless Compression**: Achieved by compressing exponent bits, suitable for the training process. - **Lossy Compression**: During inference, further reduces memory consumption by truncating mantissa bits while maintaining high performance. 3. **Compatibility**: - NeuZip is compatible with existing memory optimization techniques (such as activation checkpointing), further enhancing memory efficiency. ### Experimental Results 1. **Pre-training Experiments**: - Pre-training experiments were conducted on Transformer models of different sizes, including GPT-Neo 2.7B, Llama-3 8B, and LLama-2 13B. - Results show that NeuZip significantly reduces memory consumption while maintaining the same performance, particularly reducing memory consumption from 26.26GB to 18.58GB on the LLama-2 13B model. 2. **Fine-tuning Experiments**: - Fine-tuning experiments were conducted on the T5 model for the SQL generation task, including T5 1B, T5 3B, and T5 11B. - Results indicate that NeuZip significantly reduces memory consumption while maintaining the same BLEU score, particularly reducing memory consumption from 25.95GB to 20.68GB on the T5 11B model. 3. **Lossy Compression Experiments**: - The performance of lossy NeuZip was evaluated on language modeling tasks, including decoder models (LLama-3 8B, LLama-2 13B, Yi-1.5 34B) and encoder-decoder models (T5 1B, T5 3B, T5 11B). - Results show that lossy NeuZip significantly reduces memory consumption while maintaining high performance, with 3-bit mantissa retention achieving nearly lossless performance in all experiments. ### Conclusion NeuZip effectively addresses the issue of memory consumption during neural network training and inference through lossless and lossy compression techniques while maintaining high model performance. This provides a new solution for the training and deployment of large-scale neural networks.

NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks

ZipNN: Lossless Compression for AI Models

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

Efficient Neural Compression with Inference-time Decoding

Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference

Crossbar-Aligned & Integer-Only Neural Network Compression for Efficient In-Memory Acceleration

A Novel Low-Bit Quantization Strategy for Compressing Deep Neural Networks

ActNN: Reducing Training Memory Footprint Via 2-Bit Activation Compressed Training

Bit-Quantized-Net: an Effective Method for Compressing Deep Neural Networks.

A New Compression Method for Deep Neural Networks with Accuracy Improvement

Instance-Aware Dynamic Neural Network Quantization

Self-Compressing Neural Networks

A Highly Efficient Training-Aware Convolutional Neural Network Compression Paradigm

Weightless: Lossy Weight Encoding For Deep Neural Network Compression

Improving Neural Network Efficiency Via Post-training Quantization with Adaptive Floating-Point

Hyper-Compression: Model Compression via Hyperfunction

Accelerating Neural Network Inference by Overflow Aware Quantization

DNN Memory Footprint Reduction via Post-Training Intra-Layer Multi-Precision Quantization

Compressing and Accelerating Neural Network for Facial Point Localization.

EAST: Encoding-Aware Sparse Training for Deep Memory Compression of ConvNets

Deep Neural Network Compression Method Based on Product Quantization