Abstract:Large Language Models (LLMs) such as ChatGPT and LlaMA are advancing rapidly in generative Artificial Intelligence (AI), but their immense size poses significant challenges, such as huge training and inference costs, substantial energy demands, and limitations for on-site deployment. Traditional compression methods such as pruning, distillation, and low-rank approximation focus on reducing the effective number of neurons in the network, while quantization focuses on reducing the numerical precision of individual weights to reduce the model size while keeping the number of neurons fixed. While these compression methods have been relatively successful in practice, there is no compelling reason to believe that truncating the number of neurons is an optimal strategy. In this context, this paper introduces CompactifAI, an innovative LLM compression approach using quantum-inspired Tensor Networks that focuses on the model's correlation space instead, allowing for a more controlled, refined and interpretable model compression. Our method is versatile and can be implemented with - or on top of - other compression techniques. As a benchmark, we demonstrate that a combination of CompactifAI with quantization allows to reduce a 93% the memory size of LlaMA 7B, reducing also 70% the number of parameters, accelerating 50% the training and 25% the inference times of the model, and just with a small accuracy drop of 2% - 3%, going much beyond of what is achievable today by other compression techniques. Our methods also allow to perform a refined layer sensitivity profiling, showing that deeper layers tend to be more suitable for tensor network compression, which is compatible with recent observations on the ineffectiveness of those layers for LLM performance. Our results imply that standard LLMs are, in fact, heavily overparametrized, and do not need to be large at all.

Neural Network Language Model Compression with Product Quantization and Soft Binarization

Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training

Pse: Mixed Quantization Framework of Neural Networks for Efficient Deployment

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

Deep Neural Network Compression With Single and Multiple Level Quantization

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

On the Compressibility of Quantized Large Language Models

Weight Normalization based Quantization for Deep Neural Network Compression

Model compression as constrained optimization, with application to neural nets. Part II: quantization

QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm

Compressing Neural Language Models by Sparse Word Representations

LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit

LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

Residual Quantization for Low Bit-Width Neural Networks.

Deep learning model compression using network sensitivity and gradients

VecQ: Minimal Loss DNN Model Compression With Vectorized Weight Quantization

PB-LLM: Partially Binarized Large Language Models

Adaptive Layerwise Quantization for Deep Neural Network Compression

Comprehensive Study on Performance Evaluation and Optimization of Model Compression: Bridging Traditional Deep Learning and Large Language Models

Neural Network Compression using Binarization and Few Full-Precision Weights