Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

Habib Hajimolahoseini,Mohammad Hassanpour,Foozhan Ataiefard,Boxing Chen,Yang Liu

2024-06-28

Abstract:This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.

Computation and Language,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the issue of high computational resource and memory consumption faced by large language models (LLMs) during training and deployment. Specifically: 1. **High computational resource and memory consumption**: Existing large language models require a significant amount of computational resources and memory due to their enormous number of parameters. For example, the GPT-3 model has 175 billion parameters and requires 320 GB of storage space, making it impossible for most consumer-grade devices to run these models. 2. **Limited model variants**: To accommodate different computational resources and application scenarios, existing LLMs typically release multiple variants of different sizes (such as Llama2's 7 billion, 13 billion, and 70 billion parameter versions). However, these variants are all trained from scratch, resulting in a very limited number of variants and very high training costs for each variant. 3. **Lack of intermediate-sized models**: If a user's computational resources fall between two variants, they can only choose the smaller variant, which may not be the optimal choice. To address these issues, the paper proposes a new method—Progressive Low Rank Decomposition (PLRD). This method can generate multiple models of different sizes from a pre-trained base model without retraining from scratch. The PLRD method compresses the model by progressively reducing the rank of tensors, significantly reducing computational overhead and energy consumption while maintaining model performance. Experimental results show that models trained using the PLRD method can achieve performance comparable to models trained using traditional methods with only 1 billion tokens.

Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model

LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression

Low-Rank Prune-And-Factorize for Language Model Compression

Characterizing the Accuracy -- Efficiency Trade-off of Low-rank Decomposition in Language Models

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Deep Learning Model Compression with Rank Reduction in Tensor Decomposition.

MoDeGPT: Modular Decomposition for Large Language Model Compression

LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Aggressive Post-Training Compression on Extremely Large Language Models

Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models

LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Compact Language Models via Pruning and Knowledge Distillation

RankAdaptor: Hierarchical Rank Allocation for Efficient Fine-Tuning Pruned LLMs via Performance Model

Pruning Foundation Models for High Accuracy without Retraining

TrimLLM: Progressive Layer Dropping for Domain-Specific LLMs

The Unreasonable Ineffectiveness of the Deeper Layers

Streamlining Redundant Layers to Compress Large Language Models

DISP-LLM: Dimension-Independent Structural Pruning for Large Language Models

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models

Exploring Extreme Parameter Compression for Pre-trained Language Models