Inference acceleration for large language models using "stairs" assisted greedy generation

Domas Grigaliūnas,Mantas Lukoševičius
2024-07-29
Abstract:Large Language Models (LLMs) with billions of parameters are known for their impressive predicting capabilities but require lots of resources to run. With their massive rise in popularity, even a small reduction in required resources could have an impact on environment. On the other hand, smaller models require fewer resources but may sacrifice accuracy. In this work, we are proposing an implementation of ``stairs'' assisted greedy generation. It is a modified assisted generation methodology that makes use of a smaller model's fast generation, large model's batch prediction, and "stairs" validation in order to achieve a speed up in prediction generation. Results show between 9.58 and 17.24 percent inference time reduction compared to a stand-alone large LLM prediction in a text generation task without a loss in accuracy.
Computation and Language,Machine Learning
What problem does this paper attempt to address?
The paper aims to address the issue of large language models (LLMs) requiring substantial resources for predictions. Although these models are popular for their powerful predictive capabilities, their operation demands significant hardware, computation time, and energy, which is not only costly but also poses environmental concerns. The paper proposes a method called "stairs" assisted greedy generation, which combines the rapid generation capabilities of small models with the batch prediction capabilities of large models, along with a "stairs" verification mechanism, to achieve faster prediction generation without sacrificing accuracy. Through this method, the authors hope to reduce the inference time of large language models while maintaining prediction accuracy, thereby lowering resource consumption. Experimental results show that compared to using large language models alone, this method can reduce inference time by 9.58% to 17.24% in text generation tasks.