Abstract:We present EE-LLM, a framework for large-scale training and inference of early-exit large language models (LLMs). While recent works have shown preliminary evidence for the efficacy of early exiting in accelerating LLM inference, EE-LLM makes a foundational step towards scaling up early-exit LLMs by supporting their training and inference with massive 3D parallelism. Built upon Megatron-LM, EE-LLM implements a variety of algorithmic innovations and performance optimizations tailored to early exiting, including a lightweight method that facilitates backpropagation for the early-exit training objective with pipeline parallelism, techniques of leveraging idle resources in the original pipeline schedule for computation related to early-exit layers, and two approaches of early-exit inference that are compatible with KV caching for autoregressive generation. Our analytical and empirical study shows that EE-LLM achieves great training efficiency with negligible computational overhead compared to standard LLM training, as well as outstanding inference speedup without compromising output quality. To facilitate further research and adoption, we release EE-LLM at <a class="link-external link-https" href="https://github.com/pan-x-c/EE-LLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address the issues of training and inference of large-scale early-exit (early-exit) large language models (LLM). Specifically, the paper proposes a framework called EE-LLM, which supports the training and inference of large-scale early-exit LLMs using 3D parallel technology. #### Background and Motivation 1. **Cost and Carbon Emissions**: In recent years, large language models (LLM) have gained widespread attention for their excellent capabilities and performance in solving various problems. However, the training and deployment of these models come with significant costs and carbon emissions. 2. **Inference Cost**: In the long run, inference costs dominate because each model will be used to solve many problems and for a long duration. 3. **Need for Accelerated Inference**: This has inspired researchers and engineers to develop various methods to accelerate LLM inference. #### Concept of Early Exit - **Early Exit**: Accelerates the inference process by allowing deep neural networks to make predictions and exit the network early on certain inputs. This is achieved by adding extra early exit layers to the standard neural network architecture. - **Adaptive Computation**: Early exit models not only retain the full capacity of large models but also adaptively use fewer computational resources when solving simpler problems. #### Research Objectives - **Infrastructure Development**: The main goal of the paper is to build infrastructure that supports the training and inference of large-scale early-exit LLMs. - **Scale Expansion**: Currently, the scale of early-exit models is still relatively small, while the scale of standard LLMs is already very large. The paper aims to truly understand the effectiveness of early exit in large-scale LLMs, making it a practical option in complex scenarios. #### Challenges 1. **Memory Limitation**: How to train an early-exit LLM that cannot fit into the memory of a single device (e.g., GPU). 2. **Pipeline Parallelism**: Existing large-scale LLM training frameworks do not support the training of early-exit LLMs, especially in terms of pipeline parallelism. 3. **Computational Efficiency**: The training efficiency of early-exit generative LLMs needs to be specially designed to avoid significant computational overhead. 4. **Autoregressive Generation**: The conflict between early-exit inference and KV caching affects the generation of future tokens. #### Main Contributions - **EE-LLM Framework**: Proposes a system that supports 3D parallel large-scale early-exit LLM training and inference. - **Algorithmic Innovations**: Includes lightweight backpropagation methods, techniques for utilizing idle resources, and early-exit inference methods compatible with KV caching. - **Optimized Implementation**: Achieves extremely high efficiency in training and inference through various performance optimizations. - **Experimental Validation**: Analysis and experiments demonstrate that EE-LLM achieves significant inference acceleration with almost no additional computational overhead and without affecting output quality. In summary, by proposing the EE-LLM framework, this paper addresses multiple challenges in the training and inference of large-scale early-exit LLMs, providing strong support for future research and applications.

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

EE-Tuning: An Economical yet Scalable Solution for Tuning Early-Exit Large Language Models

An Efficient Inference Framework for Early-exit Large Language Models

E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

InternEvo: Efficient Long-sequence Large Language Model Training via Hybrid Parallelism and Redundant Sharding

LoRAExit: Empowering Dynamic Modulation of LLMs in Resource-limited Settings Using Low-rank Adapters

AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment

Early Exit is a Natural Capability in Transformer-based Models: an Empirical Study on Early Exit Without Joint Optimization

UELLM: A Unified and Efficient Approach for LLM Inference Serving

MindLLM: Lightweight Large Language Model Pre-Training, Evaluation and Domain Application

Understanding LLMs: A Comprehensive Overview from Training to Inference

Accelerating Large Language Model Inference with Self-Supervised Early Exits

3D-LLM: Injecting the 3D World into Large Language Models

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

eFedLLM: Efficient LLM Inference Based on Federated Learning

Distributed Training of Large Language Models

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Seq1F1B: Efficient Sequence-Level Pipeline Parallelism for Large Language Model Training

MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications