Adaptive Skeleton Graph Decoding
Shuowei Jin,Yongji Wu,Haizhong Zheng,Qingzhao Zhang,Matthew Lentz,Z. Morley Mao,Atul Prakash,Feng Qian,Danyang Zhuo
2024-02-20
Abstract:Large language models (LLMs) have seen significant adoption for natural
language tasks, owing their success to massive numbers of model parameters
(e.g., 70B+); however, LLM inference incurs significant computation and memory
costs. Recent approaches propose parallel decoding strategies, such as
Skeleton-of-Thought (SoT), to improve performance by breaking prompts down into
sub-problems that can be decoded in parallel; however, they often suffer from
reduced response quality. Our key insight is that we can request additional
information, specifically dependencies and difficulty, when generating the
sub-problems to improve both response quality and performance. In this paper,
we propose Skeleton Graph Decoding (SGD), which uses dependencies exposed
between sub-problems to support information forwarding between dependent
sub-problems for improved quality while exposing parallelization opportunities
for decoding independent sub-problems. Additionally, we leverage difficulty
estimates for each sub-problem to select an appropriately-sized model,
improving performance without significantly reducing quality. Compared to
standard autoregressive generation and SoT, SGD achieves a 1.69x speedup while
improving quality by up to 51%.
Computation and Language,Artificial Intelligence