Optimized Speculative Sampling for GPU Hardware Accelerators

Dominik Wagner,Seanie Lee,Ilja Baumann,Philipp Seeberger,Korbinian Riedhammer,Tobias Bocklet
2024-10-03
Abstract:In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.
Machine Learning,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to improve the sampling speed by optimizing speculative sampling on GPU hardware accelerators, thereby accelerating the inference process of autoregressive models, especially in Automatic Speech Recognition (ASR) and text summarization tasks**. ### Problem Background As autoregressive Transformer models (such as the architecture proposed by Vaswani et al.) are more and more widely used in various downstream tasks, the scale of these models is also increasing continuously. This has led to the need for more memory and computing resources, especially in application scenarios such as dialogue systems, where strict real - time constraints require higher inference speeds when generating long sequences. However, due to the sequential nature of autoregressive decoding, the inference latency will increase with the increase of sequence length and model scale, which has become a major obstacle to wide application. In addition, in many cases, smaller models can generate accurate tokens with fewer resources. Based on this assumption, speculative sampling techniques have been developed to accelerate autoregressive sampling. ### Paper Objectives This paper aims to optimize the verification part of speculative sampling to further improve the inference speed. Specifically, the authors propose two methods: 1. **Exact Optimization Method**: By taking advantage of the parallel processing capabilities of modern GPUs, the calculation of the intermediate matrices required in the speculative sampling process is distributed among multiple GPU threads, and matrix fragments are calculated simultaneously within thread blocks. 2. **Approximate Optimization Method**: Use the sigmoid function as an element - wise approximation of softmax to further accelerate speculative sampling. Although this method will lead to some loss of precision, it significantly improves the inference speed. ### Main Contributions - Implemented an accurate and faster variant of speculative sampling optimized for GPU hardware accelerators. - Explored the use of sigmoid as an element - wise approximation of softmax to achieve faster but non - exact speculative sampling. - Conducted a comprehensive evaluation on multiple tasks, covering a wide range of draft model and target model combinations. Through these optimization methods, the authors achieved a significant reduction in inference time in ASR and text summarization tasks while maintaining or slightly reducing the generation quality.