Efficient Inference for Large Language Model-based Generative Recommendation

Xinyu Lin,Chaoqun Yang,Wenjie Wang,Yongqi Li,Cunxiao Du,Fuli Feng,See-Kiong Ng,Tat-Seng Chua
2024-10-08
Abstract:Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promising solution. However, applying SD to generative recommendation presents unique challenges due to the requirement of generating top-K items (i.e., K distinct token sequences) as a recommendation list by beam search. This leads to more stringent verification in SD, where all the top-K sequences from the target LLM must be successfully drafted by the draft model at each decoding step. To alleviate this, we consider 1) boosting top-K sequence alignment between the draft model and the target LLM, and 2) relaxing the verification strategy to reduce trivial LLM calls. To this end, we propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification. Moreover, we introduce a relaxed sampling verification strategy that allows high-probability non-top-K drafted sequences to be accepted, significantly reducing LLM calls. Correspondingly, we propose AtSpeed-R for top-K alignment under this relaxed sampling verification. Empirical results on two real-world datasets demonstrate that AtSpeed significantly accelerates LLM-based generative recommendation, e.g., near 2x speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. The codes and datasets will be released in the near future.
Information Retrieval,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the generative recommendation system based on large - language models (LLMs), how to effectively accelerate the inference process, especially reducing the excessive latency caused by autoregressive decoding. Specifically: 1. **Problem Background**: - The generative recommendation systems based on LLMs have achieved remarkable success, but their actual deployment costs are high, especially due to the excessive inference latency caused by autoregressive decoding. - Speculative Decoding (SD) is a promising solution that can accelerate the decoding process without sacrificing accuracy. However, when applying SD to generative recommendation systems, there are unique challenges because the recommendation task needs to generate a list of the top - K best items through beam search, which makes the verification process more stringent. 2. **Specific Challenges**: - In traditional NLP tasks, SD usually only needs to generate one response (N - to - 1 verification), while in recommendation tasks, K different sequences need to be generated (N - to - K verification). This means that each decoding step must successfully generate all K best sequences from N candidate sequences, which places higher requirements on the verification process. - If all K best sequences cannot be successfully generated, the call to the target LLM cannot be skipped, thus affecting the acceleration effect. 3. **Solutions**: - To address these challenges, the authors propose an alignment framework named AtSpeed, aiming to improve the top - K alignment between the draft model and the target LLM and reduce unnecessary LLM calls by relaxing the verification strategy. - AtSpeed consists of two main components: 1. **Strict top - K alignment (AtSpeed - S)**: Optimize the draft model to better align the top - K sequences generated by the target LLM by minimizing the reverse Kullback - Leibler divergence (RKLD) and the probability density regularization term. 2. **Relaxed sampling verification (AtSpeed - R)**: Allow the acceptance of non - top - K sequences with high probability, thereby significantly reducing the number of LLM calls while maintaining recommendation accuracy. 4. **Experimental Results**: - The experimental results show that AtSpeed significantly accelerates the decoding process of the LLM - based recommendation system on two real - world datasets. For example, it achieves nearly a 2 - fold speedup under strict top - K verification and a maximum 2.5 - fold speedup under relaxed sampling verification. In summary, this paper aims to solve the problem of the time - consuming inference process in the LLM - based generative recommendation system by introducing the AtSpeed framework, thereby achieving efficient and accurate recommendation services.