Abstract:In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to improve the sampling speed by optimizing speculative sampling on GPU hardware accelerators, thereby accelerating the inference process of autoregressive models, especially in Automatic Speech Recognition (ASR) and text summarization tasks**. ### Problem Background As autoregressive Transformer models (such as the architecture proposed by Vaswani et al.) are more and more widely used in various downstream tasks, the scale of these models is also increasing continuously. This has led to the need for more memory and computing resources, especially in application scenarios such as dialogue systems, where strict real - time constraints require higher inference speeds when generating long sequences. However, due to the sequential nature of autoregressive decoding, the inference latency will increase with the increase of sequence length and model scale, which has become a major obstacle to wide application. In addition, in many cases, smaller models can generate accurate tokens with fewer resources. Based on this assumption, speculative sampling techniques have been developed to accelerate autoregressive sampling. ### Paper Objectives This paper aims to optimize the verification part of speculative sampling to further improve the inference speed. Specifically, the authors propose two methods: 1. **Exact Optimization Method**: By taking advantage of the parallel processing capabilities of modern GPUs, the calculation of the intermediate matrices required in the speculative sampling process is distributed among multiple GPU threads, and matrix fragments are calculated simultaneously within thread blocks. 2. **Approximate Optimization Method**: Use the sigmoid function as an element - wise approximation of softmax to further accelerate speculative sampling. Although this method will lead to some loss of precision, it significantly improves the inference speed. ### Main Contributions - Implemented an accurate and faster variant of speculative sampling optimized for GPU hardware accelerators. - Explored the use of sigmoid as an element - wise approximation of softmax to achieve faster but non - exact speculative sampling. - Conducted a comprehensive evaluation on multiple tasks, covering a wide range of draft model and target model combinations. Through these optimization methods, the authors achieved a significant reduction in inference time in ASR and text summarization tasks while maintaining or slightly reducing the generation quality.

Optimized Speculative Sampling for GPU Hardware Accelerators

Turbo: Opportunistic Enhancement for Edge Video Analytics

Leveraging Speculative Sampling and KV-Cache Optimizations Together for Generative AI using OpenVINO

BASS: Batched Attention-optimized Speculative Sampling

Accelerating Large Language Model Decoding with Speculative Sampling

Accelerating Stratified Sampling SGD by Reconstructing Strata

SpecTr: Fast Speculative Decoding via Optimal Transport

A Preliminary Study on Accelerating Simulation Optimization with GPU Implementation

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

Optimizing GPU-based Graph Sampling and Random Walk for Efficiency and Scalability

Accelerating Diffusion Sampling with Optimized Time Steps

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Revisiting Approximate Query Processing and Bootstrap Error Estimation on GPU

Faster Sampling via Stochastic Gradient Proximal Sampler

Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe Algorithm under Parallelization

Automatic source code generation for deterministic global optimization with parallel architectures

Diffusion Sampling Correction via Approximately 10 Parameters

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation

High-Performance Constant-Time Discrete Gaussian Sampling

Accelerated stochastic approximation with state-dependent noise

Optimizing Speculative Decoding for Serving Large Language Models Using Goodput