Abstract:As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation requires a synchronized update operation among each partial softmax result, leading to ~20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference is flat, leading to under-utilized computation and >50% performance loss after padding zeros in previous designs. (3) Performance loss due to static dataflow. Kernel performance in LLM depends on varied input data features, hardware configurations, etc. A single and static dataflow may lead to a 50.25% performance loss for GEMMs of different shapes in LLM inference. We present FlashDecoding++, a fast LLM inference engine supporting mainstream LLMs and hardware back-ends. To tackle the above challenges, FlashDecoding++ creatively proposes: (1) Asynchronized softmax with unified max value. FlashDecoding++ introduces a unified max value technique for different partial softmax computations to avoid synchronization. (2) Flat GEMM optimization with double buffering. FlashDecoding++ points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced. (3) Heuristic dataflow with hardware resource adaptation. FlashDecoding++ heuristically optimizes dataflow using different hardware resource considering input dynamics. Due to the versatility of optimizations in FlashDecoding++, FlashDecoding++ can achieve up to 4.86x and 2.18x speedup on both NVIDIA and AMD GPUs compared to Hugging Face implementations. FlashDecoding++ also achieves an average speedup of 1.37x compared to state-of-the-art LLM inference engines on mainstream LLMs.

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

Chimera: A Lossless Decoding Method for Accelerating Large Language Models Inference by Fusing all Tokens

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Inference with Reference: Lossless Acceleration of Large Language Models

Accelerating the Training of Large Language Models Using Efficient Activation Rematerialization and Optimal Hybrid Parallelism.

SPEED: Speculative Pipelined Execution for Efficient Decoding

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Tandem Transformers for Inference Efficient LLMs

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

On Speculative Decoding for Multimodal Large Language Models

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

FlashDecoding++: Faster Large Language Model Inference on GPUs