Abstract:High-dimensional token embeddings underpin Large Language Models (LLMs), as they can capture subtle semantic information and significantly enhance the modelling of complex language patterns. However, this high dimensionality also introduces considerable model parameters and prohibitively high model storage and memory requirements, which is particularly unaffordable for low-end devices. Targeting no extra training data and insufficient computation cases, we propose a training-free model compression approach based on the Tensor-Train Decomposition (TTD), whereby each pre-trained token embedding is converted into a lower-dimensional Matrix Product State (MPS). We then comprehensively investigate the low-rank structures extracted by this approach, in terms of the compression ratio, the language task performance, and latency on a typical low-end device (i.e. Raspberry Pi). Taking GPT family models (i.e. GPT-2 and CerebrasGPT) as case studies, our approach theoretically results in $46.89\%$ fewer parameters of the entire model, with a compression ratio $39.38\times$ - $65.64\times$ for the embedding layers. With different hyperparameter choices, the model compressed with our approach can achieve a comparable language task performance to the original model with around $2.0\times$ embedding layer compression. This empirically proves the existence of low-rank structure in GPT family models, and demonstrates that about half of the parameters in the embedding layers are redundant.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the main issues faced by large language models (LLMs) when applied to low-power devices, namely the excessive model parameters, storage, and memory demands caused by high-dimensional word embedding layers. Specifically: 1. **High-Dimensional Word Embeddings**: While high-dimensional word embeddings can capture subtle semantic information and significantly enhance the modeling capabilities of complex language patterns, they also introduce a large number of model parameters, leading to high storage and memory requirements, which are particularly challenging for low-end devices. 2. **Model Compression Without Additional Training Data and Computational Resources**: Existing model compression methods typically require additional training data or computational resources, which are often not feasible on low-end devices. Therefore, the paper proposes a model compression method that does not require additional training. ### Solution To address the above issues, the paper proposes a training-free model compression method based on Tensor-Train Decomposition (TTD). The specific steps are as follows: 1. **Tensorization**: Convert each pre-trained word embedding vector into a low-dimensional Matrix Product State (MPS). 2. **Tensor Decomposition**: Decompose each word embedding vector through Tensor-Train Decomposition (TTD) to represent it in a low-rank MPS format. 3. **Evaluate Compression Effectiveness**: The effectiveness of the low-rank structure extraction is comprehensively studied in terms of compression ratio, language task performance, and latency on typical low-end devices (e.g., Raspberry Pi). ### Main Contributions 1. **First Use of Low-Rank Factorization to Compress LLMs for Low-End Devices**: The Tensor-Train Decomposition is adapted to accommodate the non-parallel operations of the embedding layer, which other block methods cannot achieve. 2. **Testing the Method on Language Modeling and Sentiment Classification Tasks**: The compressed models can even outperform the uncompressed models in these tasks. Specifically, large-scale models generally perform better in terms of accuracy and F1 score after compression and are more robust in language modeling tasks. 3. **Providing Technical Ablation Studies**: The delay of sub-billion GPT models on Raspberry Pi under different compression scenarios is measured, and a detailed system analysis is conducted. ### Experimental Results - **Compression Ratio and Language Task Performance**: For the GPT series models, the compression ratio can reach up to 39.38x to 65.64x while maintaining acceptable language task performance. - **Robustness of Large-Scale Models**: Large-scale models (e.g., CerebrasGPT-590M and CerebrasGPT-1.3B) exhibit better robustness as the compression ratio increases, especially when the compression ratio is less than 1.0x. - **Latency**: Although TensorGPT significantly reduces model parameters and improves language task performance, it also introduces compression and inference latency in practical applications. However, these latencies are within an acceptable range, particularly for single text inputs, with latency not exceeding 0.3 seconds. ### Related Work - **Matrix or Tensor Factorization for Language Model Compression**: Existing methods are mainly divided into matrix-based and tensor-based methods. Matrix-based methods include Singular Value Decomposition (SVD), weighted methods, knowledge distillation, and pruning. Tensor-based methods include Kronecker decomposition and tensor-train structures, but these methods usually require additional training processes. - **Tensor Networks and Tensor Network Structure Search**: Some research explores how to use tensor network structure search to optimize model compression, but these methods also typically require additional training processes. In summary, this paper effectively addresses the high storage and memory demands faced by large language models on low-end devices by proposing a model compression method based on Tensor-Train Decomposition that does not require additional training.

TensorGPT: Efficient Compression of Large Language Models based on Tensor-Train Decomposition

TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition

FoldGPT: Simple and Effective Large Language Model Compression Scheme

MoDeGPT: Modular Decomposition for Large Language Model Compression

SliceGPT: Compress Large Language Models by Deleting Rows and Columns

Blockwise Compression of Transformer-based Models without Retraining

Accelerating Large Language Model Training with Hybrid GPU-based Compression

Exploring Extreme Parameter Compression for Pre-trained Language Models

Kronecker Decomposition for GPT Compression

TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones

LLM Vocabulary Compression for Low-Compute Environments

ShortGPT: Layers in Large Language Models are More Redundant Than You Expect

Efficient GPT Model Pre-training using Tensor Train Matrix Representation

CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks

Communication Compression for Tensor Parallel LLM Inference

Semantic Compression With Large Language Models

Aggressive Post-Training Compression on Extremely Large Language Models

TQCompressor: improving tensor decomposition methods in neural networks via permutations

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Tensor train decompositions on recurrent networks

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot