Abstract:With the rise of Speech Large Language Models (Speech LLMs), there has been growing interest in discrete speech tokens for their ability to integrate with text-based tokens seamlessly. Compared to most studies that focus on continuous speech features, although discrete-token based LLMs have shown promising results on certain tasks, the performance gap between these two paradigms is rarely explored. In this paper, we present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks using a light-weight LLM (Qwen1.5-0.5B). Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding. Moreover, this study goes beyond surface-level comparison by identifying key factors behind the under-performance of discrete tokens, such as limited token granularity and inefficient information retention. To enhance the performance of discrete tokens, we explore potential aspects based on our analysis. We hope our results can offer new insights into the opportunities for advancing discrete speech tokens in Speech LLMs.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are: when using discrete speech tokens and continuous features in large - language models (LLMs) for semantically - related tasks, the performance gap between the two and the reasons behind it. Specifically, researchers hope to comprehensively compare these two types of speech representation methods to reveal their performance differences in tasks such as automatic speech recognition (ASR), phoneme recognition (PR), speech translation (ST), keyword spotting (KS), spoken - intent classification (IC), and emotion recognition (ER), and explore the potential improvement directions of discrete speech tokens in these tasks. ### Main problem points: 1. **Performance gap**: Although discrete speech tokens show potential in some tasks, what is their overall performance compared to continuous features? Especially in tasks requiring detailed semantic understanding, how large is the performance gap between the two? 2. **Influencing factors**: Why do discrete speech tokens perform worse than continuous features in some tasks? Researchers hope to find key influencing factors through analysis, such as limited token granularity and low information - retention efficiency. 3. **Improvement directions**: Based on the above analysis, how can the performance of discrete speech tokens be improved to make them reach or approach the level of continuous features in more tasks? ### Research methods: - **Data sets**: Researchers used multiple data sets, including LibriSpeech, GigaSpeech, Speech Commands v2, SLURP, and IEMOCAP, covering a variety of semantically - related tasks. - **Models**: Mainly used the lightweight Qwen1.5 - 0.5B model as a decoder, and also conducted experiments on the larger LLaMA3.1 - 8B model to evaluate the impact of different - scale models on the performance of discrete speech tokens. - **Processing flow**: - **Continuous features**: Map high - dimensional embeddings to the input space of the LLM through down - sampling and linear adapters. - **Discrete tokens**: Generate discrete tokens using K - means clustering, and further optimize the token sequence through deduplication and byte - pair encoding (BPE). ### Key findings: - **Overall performance**: Continuous features perform better in most tasks, especially in tasks requiring detailed semantic understanding. - **Influencing factors**: The performance of discrete tokens is restricted by problems such as limited token granularity, low information - retention efficiency, and unbalanced token distribution. - **Improvement directions**: By using larger - scale LLMs, optimizing layer selection, and improving token - generation methods, the performance of discrete tokens can be significantly improved. ### Conclusion: This research comprehensively compared the performance of discrete speech tokens and continuous features in a variety of semantically - related tasks, revealed the performance gap between the two and the reasons behind it, and proposed potential improvement directions. This provides a valuable reference for better using discrete speech tokens in large - language models in the future.

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Comparing Discrete and Continuous Space LLMs for Speech Recognition

Exploring SSL Discrete Tokens for Multilingual ASR

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Continuous Speech Tokenizer in Text To Speech

SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

A Survey on Speech Large Language Models

SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models

Tuning Large Language Model for Speech Recognition With Mixed-Scale Re-Tokenization

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Using Large Language Model for End-to-End Chinese ASR and NER

What Makes for Good Visual Tokenizers for Large Language Models?

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model

Evaluating and Mitigating Linguistic Discrimination in Large Language Models

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners

DASB -- Discrete Audio and Speech Benchmark