Abstract:The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive inference latency in autoregressive generation tasks of large language models (LLMs). Specifically, although large - scale language models based on the Transformer architecture have made remarkable progress in the field of natural language processing, due to their large model sizes and complex runtime complexity, they result in long inference latencies, which limit the deployment of these models in applications requiring real - time responses and make their costs too high. #### Main problems: 1. **Inference latency**: Large language models need to generate tokens one by one in autoregressive generation tasks (such as machine translation, summary generation, etc.), and cannot utilize token - level parallelism, resulting in increased inference latency. 2. **Low hardware utilization**: Due to the characteristics of autoregressive generation tasks, each time a token is generated, the weight matrix and cached key - value pairs need to be loaded, which makes the inference process limited by memory bandwidth and results in low hardware utilization. 3. **Limited real - time applications**: Long inference latencies make these models difficult to be applied to tasks requiring real - time responses, such as online services, real - time dialogue systems, etc. To solve these problems, the authors propose the Big Little Decoder (BiLD) framework, which reduces inference latency while maintaining or improving the generation quality by introducing two models of different scales to work together. Specifically: - **Small model**: Generates text autoregressively at a lower inference cost. - **Large model**: Is occasionally invoked to correct the inaccurate predictions of the small model in a non - autoregressive manner. In addition, the BiLD framework introduces two strategies to coordinate the work of these two models: 1. **Fallback Policy**: When the prediction confidence of the small model is lower than a certain threshold, the control is handed over to the large model. 2. **Rollback Policy**: When the large model detects that some predictions of the small model are inaccurate, it will roll back and replace these predictions. In this way, BiLD can significantly reduce inference latency while maintaining high generation quality and is suitable for various text generation tasks, such as machine translation and summary generation. #### Summary: The main goal of this paper is to significantly reduce the inference latency of large language models in autoregressive generation tasks by designing a new framework (BiLD) without changing the existing training process or model architecture, thereby improving the feasibility and efficiency of their real - time applications.

Speculative Decoding with Big Little Decoder

MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

The Synergy of Speculative Decoding and Batching in Serving Large Language Models

SSSD: Simply-Scalable Speculative Decoding

Online Speculative Decoding

SPEED: Speculative Pipelined Execution for Efficient Decoding

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Graph-Structured Speculative Decoding

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

On Speculative Decoding for Multimodal Large Language Models

Mixture of Attentions For Speculative Decoding

Decoding Speculative Decoding

Accelerating LLM Inference with Staged Speculative Decoding

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Tandem Transformers for Inference Efficient LLMs

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Cascade Speculative Drafting for Even Faster LLM Inference