SpecPIM: Accelerating Speculative Inference on PIM-Enabled System Via Architecture-Dataflow Co-Exploration

Cong Li,Zhe Zhou,Size Zheng,Jiaxi Zhang,Yun Liang,Guangyu Sun
DOI: https://doi.org/10.1145/3620666.3651352
2024-01-01
Abstract:Generative large language models' (LLMs) inference suffers from inefficiency because of the token dependency brought by autoregressive decoding. Recently, speculative inference has been proposed to alleviate this problem, which introduces small language models to generate draft tokens and adopts the original large language model to conduct verification. Although speculative inference can enhance the efficiency of the decoding procedure, we find that it presents variable resource demands due to the distinct computation patterns of the models used in speculative inference. This variability impedes the full realization of speculative inference's acceleration potential in current systems.
What problem does this paper attempt to address?