Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Pengfei Wu,Jiahao Liu,Zhuocheng Gong,Qifan Wang,Jinpeng Li,Jingang Wang,Xunliang Cai,Dongyan Zhao
2024-04-18
Abstract:Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens.
Computation and Language
What problem does this paper attempt to address?
The paper proposes a new parallel decoding method called Hidden Transition to address the issue of inference latency in large-scale language models (LLMs). Traditional autoregressive decoding methods generate one token at a time in each forward propagation, which does not fully utilize the parallel computing capability of GPUs. The paper suggests decoding multiple subsequent tokens simultaneously in a single forward propagation by passing the intermediate hidden states to the pseudo-hidden states of the tokens to be generated in the future. These synthesized pseudo-hidden states then pass through subsequent layers to obtain more semantic information and improve prediction accuracy. Additionally, a tree-like attention mechanism is utilized to generate and validate multiple output sequence candidates concurrently, ensuring lossless generation and improving efficiency. Experimental results demonstrate that this method outperforms other single-model acceleration techniques in terms of acceleration metrics.