Abstract:Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promising solution. However, applying SD to generative recommendation presents unique challenges due to the requirement of generating top-K items (i.e., K distinct token sequences) as a recommendation list by beam search. This leads to more stringent verification in SD, where all the top-K sequences from the target LLM must be successfully drafted by the draft model at each decoding step. To alleviate this, we consider 1) boosting top-K sequence alignment between the draft model and the target LLM, and 2) relaxing the verification strategy to reduce trivial LLM calls. To this end, we propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification. Moreover, we introduce a relaxed sampling verification strategy that allows high-probability non-top-K drafted sequences to be accepted, significantly reducing LLM calls. Correspondingly, we propose AtSpeed-R for top-K alignment under this relaxed sampling verification. Empirical results on two real-world datasets demonstrate that AtSpeed significantly accelerates LLM-based generative recommendation, e.g., near 2x speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. The codes and datasets will be released in the near future.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the generative recommendation system based on large - language models (LLMs), how to effectively accelerate the inference process, especially reducing the excessive latency caused by autoregressive decoding. Specifically: 1. **Problem Background**: - The generative recommendation systems based on LLMs have achieved remarkable success, but their actual deployment costs are high, especially due to the excessive inference latency caused by autoregressive decoding. - Speculative Decoding (SD) is a promising solution that can accelerate the decoding process without sacrificing accuracy. However, when applying SD to generative recommendation systems, there are unique challenges because the recommendation task needs to generate a list of the top - K best items through beam search, which makes the verification process more stringent. 2. **Specific Challenges**: - In traditional NLP tasks, SD usually only needs to generate one response (N - to - 1 verification), while in recommendation tasks, K different sequences need to be generated (N - to - K verification). This means that each decoding step must successfully generate all K best sequences from N candidate sequences, which places higher requirements on the verification process. - If all K best sequences cannot be successfully generated, the call to the target LLM cannot be skipped, thus affecting the acceleration effect. 3. **Solutions**: - To address these challenges, the authors propose an alignment framework named AtSpeed, aiming to improve the top - K alignment between the draft model and the target LLM and reduce unnecessary LLM calls by relaxing the verification strategy. - AtSpeed consists of two main components: 1. **Strict top - K alignment (AtSpeed - S)**: Optimize the draft model to better align the top - K sequences generated by the target LLM by minimizing the reverse Kullback - Leibler divergence (RKLD) and the probability density regularization term. 2. **Relaxed sampling verification (AtSpeed - R)**: Allow the acceptance of non - top - K sequences with high probability, thereby significantly reducing the number of LLM calls while maintaining recommendation accuracy. 4. **Experimental Results**: - The experimental results show that AtSpeed significantly accelerates the decoding process of the LLM - based recommendation system on two real - world datasets. For example, it achieves nearly a 2 - fold speedup under strict top - K verification and a maximum 2.5 - fold speedup under relaxed sampling verification. In summary, this paper aims to solve the problem of the time - consuming inference process in the LLM - based generative recommendation system by introducing the AtSpeed framework, thereby achieving efficient and accurate recommendation services.

Efficient Inference for Large Language Model-based Generative Recommendation

A Decoding Acceleration Framework for Industrial Deployable LLM-based Recommender Systems

Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

Rethinking Large Language Model Architectures for Sequential Recommendations

SDSAT: Accelerating LLM Inference through Speculative Decoding with Semantic Adaptive Tokens

Graph-Structured Speculative Decoding

SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Cascade Speculative Drafting for Even Faster LLM Inference

S2D: Sorted Speculative Decoding For More Efficient Deployment of Nested Large Language Models

Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement

Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

Improving Multi-candidate Speculative Decoding

Direct Alignment of Draft Model for Speculative Decoding with Chat-Fine-Tuned LLMs

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Speculative Streaming: Fast LLM Inference without Auxiliary Models