Abstract:Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data imparts speech recognition (ASR) abilities to the decoder-only model, hence called Speech-LLaMA. Nevertheless, due to the sequential nature of auto-regressive inference and the relatively large decoder, Speech-LLaMA models require relatively high inference time. In this work, we propose to speed up Speech-LLaMA inference by predicting multiple tokens in the same decoding step. We explore several model architectures that enable this, and investigate their performance using threshold-based and verification-based inference strategies. We also propose a prefix-based beam search decoding method that allows efficient minimum word error rate (MWER) training for such models. We evaluate our models on a variety of public benchmarks, where they reduce the number of decoder calls by ~3.2x while maintaining or improving WER performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the slow inference speed of the Speech - LLaMA model in the Automatic Speech Recognition (ASR) task. Specifically, due to the sequential nature of the autoregressive decoding process of the Speech - LLaMA model and its relatively large decoder, it results in a high inference time. To solve this problem, the author proposes to accelerate the inference process of Speech - LLaMA through multi - token prediction. ### Specific description of the problem 1. **Long inference time**: The autoregressive decoder of the Speech - LLaMA model needs to generate each token one by one, which leads to a long inference time. 2. **Memory bandwidth limitation**: Large - scale decoders (such as LLM) need to be loaded into the computational memory every time a token is generated, which makes the inference process limited by the memory bandwidth. 3. **Low utilization of computational resources**: The traditional autoregressive decoding method cannot fully utilize the available computational resources, resulting in low inference efficiency. ### Solutions To accelerate the inference process of Speech - LLaMA, the author proposes the following methods: 1. **Multi - token prediction**: Predict multiple tokens in a single decoding step, thereby reducing the number of required decoding steps. Specifically, reduce the sequence of length \( U \) that originally required \( U \) steps to generate to \( \frac{U}{K} \) steps, where \( K \) is the number of tokens predicted each time. 2. **Model architecture improvement**: - **Independent projection heads**: Use multiple independent projection heads to calculate the probabilities of multiple tokens in parallel. - **Latent space expansion**: By decomposing each projection head into a full - rank matrix and a shared un - embedded matrix, reduce the number of additional parameters and make the model more compact. 3. **Inference strategy**: - **Threshold selection**: By setting a threshold \( \tau \), select multiple tokens that meet the conditions. - **Verification selection**: Combine the prediction and verification steps to ensure that the generated sequence is consistent with the result of autoregressive decoding. 4. **Training objective**: - Expand the cross - entropy loss function to cover all \( K \) predictions. - Use the Minimum Word Error Rate (MWER) for sequence discriminative training to improve the robustness of the model. ### Experimental results Through experiments, the author shows the performance of the proposed method on multiple public benchmark datasets. The results show that the multi - token prediction method can significantly reduce the number of decoder invocations (about 3.2 times) while maintaining or improving ASR performance, thereby accelerating the inference process. ### Summary The main contribution of this paper is that by introducing the multi - token prediction technique, it effectively solves the problem of slow inference speed of the Speech - LLaMA model in the ASR task while maintaining good recognition performance.

Faster Speech-LLaMA Inference with Multi-token Prediction

Optimized Multi-Token Joint Decoding with Auxiliary Model for LLM Inference

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Efficient Streaming LLM for Speech Recognition

Inference with Reference: Lossless Acceleration of Large Language Models

On Speculative Decoding for Multimodal Large Language Models

FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

On decoder-only architecture for speech-to-text and large language model integration

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

Tandem Transformers for Inference Efficient LLMs

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Speculative Decoding via Early-exiting for Faster LLM Inference with Thompson Sampling Control Mechanism

Prompting Large Language Models with Speech Recognition Abilities

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

LLMCad: Fast and Scalable On-device Large Language Model Inference

SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context

LiveMind: Low-latency Large Language Models with Simultaneous Inference