Abstract:As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for serious financial products serving billions of users like Alipay. However, for a real-world product serving millions of users, the inference speed of LLMs becomes a critical factor compared to a mere experimental model. Hence, this paper presents a generic framework for accelerating the inference process, resulting in a substantial increase in speed and cost reduction for our LLM-based scenarios, with lossless generation accuracy. In the traditional inference process, each token is generated sequentially by the LLM, leading to a time consumption proportional to the number of generated tokens. To enhance this process, our framework, named \textit{lookahead}, introduces a \textit{multi-branch} strategy. Instead of generating a single token at a time, we propose a Trie-based retrieval and verification mechanism to be able to accept several tokens at a forward step. Our strategy offers two distinct advantages: (1) it guarantees absolute correctness of the output, avoiding any approximation algorithms, and (2) the worst-case performance of our approach is equivalent to the conventional process. We conduct extensive experiments to demonstrate the significant improvements achieved by applying our inference acceleration framework. Our framework is widely deployed in Alipay since April 2023, and obtain remarkable 2.66x to 6.26x speedup. Our code is available at <a class="link-external link-https" href="https://github.com/alipay/PainlessInferenceAcceleration" rel="external noopener nofollow">this https URL</a>.

Inference acceleration for large language models using "stairs" assisted greedy generation

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Inference with Reference: Lossless Acceleration of Large Language Models

LLMCad: Fast and Scalable On-device Large Language Model Inference

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy

BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models

Inference Acceleration for Large Language Models on CPUs

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management

Small Language Models Improve Giants by Rewriting Their Outputs

Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Efficient and Economic Large Language Model Inference with Attention Offloading