Cascade Speculative Drafting for Even Faster LLM Inference

Ziyi Chen,Xiaocong Yang,Jiacheng Lin,Chenkai Sun,Kevin Chen-Chuan Chang,Jie Huang
2024-02-27
Abstract:Introduced to enhance the efficiency of large language model (LLM) inference, speculative decoding operates by having a smaller model generate a draft. A larger target model then reviews this draft to align with its output, and any acceptance by the target model results in a reduction of the number of the target model runs, ultimately improving efficiency. However, the drafting process in speculative decoding includes slow autoregressive generation and allocates equal time to generating tokens, irrespective of their importance. These inefficiencies collectively contribute to the suboptimal performance of speculative decoding. To further improve LLM inference, we introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models, while the Horizontal Cascade optimizes time allocation in drafting for improved efficiency. Combining both cascades, CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments, while maintaining the same output distribution as the target model. Our code is publicly available at <a class="link-external link-https" href="https://github.com/lfsszd/CS-Drafting" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high - latency issue in the inference process of large - scale language models (LLMs). Specifically, as the scale of LLMs continues to increase, especially in long - text generation tasks, due to the token - by - token generation method caused by the autoregressive generation method, the model inference process becomes very time - consuming. To solve this problem, the existing speculative decoding technology first generates a draft using a small model and then has it reviewed by the target large model to reduce the number of runs of the target model, thereby improving efficiency. However, this technology still has the disadvantages of inefficient autoregressive generation and unreasonable time allocation for each token generation in the process of generating the draft, and these factors together lead to the sub - optimal performance of the speculative decoding technology. To further improve the inference efficiency of LLMs, the paper introduces a new algorithm - Cascade Speculative Drafting (CS Drafting). This algorithm optimizes the speculative decoding process by introducing two mechanisms: Vertical Cascade and Horizontal Cascade. The Vertical Cascade eliminates autoregressive generation in neural models, while the Horizontal Cascade optimizes the time allocation for generating drafts to improve overall efficiency. Combining these two mechanisms, CS Drafting achieves an additional acceleration of up to 81% compared to the traditional speculative decoding technology in experiments while maintaining the same output distribution as the target model. The main contributions of the paper include: 1. Introducing the CS Drafting algorithm, which can improve the inference speed of language models without sacrificing generation quality. 2. Providing a theoretical analysis to support the effectiveness of the proposed CS Drafting method. 3. Proving through experiments that CS Drafting can achieve further acceleration compared to the traditional speculative decoding technology under different tasks and settings.