Accelerating Large Language Model Inference with Self-Supervised Early Exits

Florian Valade
2024-07-30
Abstract:This paper presents a novel technique for accelerating inference in large, pre-trained language models (LLMs) by introducing early exits during inference. The computational demands of these models, used across a wide range of applications, can be substantial. By capitalizing on the inherent variability in token complexity, our approach enables selective acceleration of the inference process. Specifically, we propose the integration of early exit ''heads'' atop existing transformer layers, which facilitate conditional terminations based on a confidence metric. These heads are trained in a self-supervised manner using the model's own predictions as training data, thereby eliminating the need for additional annotated data. The confidence metric, established using a calibration set, ensures a desired level of accuracy while enabling early termination when confidence exceeds a predetermined threshold. Notably, our method preserves the original accuracy and reduces computational time on certain tasks, leveraging the existing knowledge of pre-trained LLMs without requiring extensive retraining. This lightweight, modular modification has the potential to greatly enhance the practical usability of LLMs, particularly in applications like real-time language processing in resource-constrained environments.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the problem of accelerating the inference process in large pre-trained language models (LLMs). Specifically, the paper proposes a novel technique to reduce computational costs by introducing an "early exit" mechanism during the inference process. Although large language models perform excellently in natural language processing (NLP) tasks, their enormous computational demands often require powerful server infrastructure support, which limits local applications and brings about privacy issues and high costs. To address these problems, the paper presents the following contributions: 1. **Lightweight Modular Enhancement**: Integrating early exit mechanisms into existing pre-trained language models, with these exit points strategically placed within the model and trained in a self-supervised manner, using the model's own outputs as training targets. 2. **Dynamic Threshold Setting**: Proposing a new method to determine when to apply early exit during inference. This includes generating a calibration set and establishing confidence thresholds in a self-supervised manner. These thresholds allow the model to decide whether to continue processing based on the confidence level of each step's prediction. This approach not only improves the efficiency of LLMs but also maintains the accuracy and completeness of the model outputs, making it particularly suitable for environments requiring real-time processing capabilities and limited resources. In summary, the paper aims to enhance the inference speed of large language models through a flexible and adaptive method without sacrificing their accuracy.