Accelerating Large Language Model Inference with Self-Supervised Early Exits

Florian Valade

2024-07-30

Abstract:This paper presents a novel technique for accelerating inference in large, pre-trained language models (LLMs) by introducing early exits during inference. The computational demands of these models, used across a wide range of applications, can be substantial. By capitalizing on the inherent variability in token complexity, our approach enables selective acceleration of the inference process. Specifically, we propose the integration of early exit ''heads'' atop existing transformer layers, which facilitate conditional terminations based on a confidence metric. These heads are trained in a self-supervised manner using the model's own predictions as training data, thereby eliminating the need for additional annotated data. The confidence metric, established using a calibration set, ensures a desired level of accuracy while enabling early termination when confidence exceeds a predetermined threshold. Notably, our method preserves the original accuracy and reduces computational time on certain tasks, leveraging the existing knowledge of pre-trained LLMs without requiring extensive retraining. This lightweight, modular modification has the potential to greatly enhance the practical usability of LLMs, particularly in applications like real-time language processing in resource-constrained environments.

Computation and Language,Machine Learning

What problem does this paper attempt to address?

The paper attempts to address the problem of accelerating the inference process in large pre-trained language models (LLMs). Specifically, the paper proposes a novel technique to reduce computational costs by introducing an "early exit" mechanism during the inference process. Although large language models perform excellently in natural language processing (NLP) tasks, their enormous computational demands often require powerful server infrastructure support, which limits local applications and brings about privacy issues and high costs. To address these problems, the paper presents the following contributions: 1. **Lightweight Modular Enhancement**: Integrating early exit mechanisms into existing pre-trained language models, with these exit points strategically placed within the model and trained in a self-supervised manner, using the model's own outputs as training targets. 2. **Dynamic Threshold Setting**: Proposing a new method to determine when to apply early exit during inference. This includes generating a calibration set and establishing confidence thresholds in a self-supervised manner. These thresholds allow the model to decide whether to continue processing based on the confidence level of each step's prediction. This approach not only improves the efficiency of LLMs but also maintains the accuracy and completeness of the model outputs, making it particularly suitable for environments requiring real-time processing capabilities and limited resources. In summary, the paper aims to enhance the inference speed of large language models through a flexible and adaptive method without sacrificing their accuracy.

Accelerating Large Language Model Inference with Self-Supervised Early Exits

A Global Past-Future Early Exit Method for Accelerating Inference of Pre-trained Language Models.

An Efficient Inference Framework for Early-exit Large Language Models

EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism

Dynamic Vocabulary Pruning in Early-Exit LLMs

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Accelerating Inference for Pretrained Language Models by Unified Multi-Perspective Early Exiting.

Self-Selected Attention Span for Accelerating Large Language Model Inference

Early Exit is a Natural Capability in Transformer-based Models: an Empirical Study on Early Exit Without Joint Optimization

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

On Speeding Up Language Model Evaluation

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

Accelerating Pretrained Language Model Inference Using Weighted Ensemble Self-distillation

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

LoRAExit: Empowering Dynamic Modulation of LLMs in Resource-limited Settings Using Low-rank Adapters

ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

Efficient and Economic Large Language Model Inference with Attention Offloading

Inference acceleration for large language models using "stairs" assisted greedy generation