Abstract:Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the efficiency and performance of models in Automatic Speech Recognition (ASR), Speech Translation (ST) and Spoken Language Understanding (SLU) tasks by using discrete speech units. Specifically, the paper explores how to use the discrete speech units extracted from self - supervised learning (SSL) models to replace traditional high - dimensional continuous speech features, in order to reduce the size of data storage and transmission while maintaining or improving the predictive performance of the model. ### Main research questions: 1. **Improving computational efficiency**: By using discrete speech units, the paper aims to significantly reduce training time and inference time without degrading model performance. 2. **Reducing redundancy**: Traditional high - dimensional speech features (such as Mel - spectrograms) contain a large amount of redundant information, resulting in low efficiency in model training and inference. The paper proposes to further compress the sequence length through methods such as deduplication and sub - word modeling to improve efficiency. 3. **Verifying wide applicability**: The paper conducts experiments on a variety of different types of speech datasets, including clean pronunciation data, noisy data, telephone data and multilingual data, to verify the effectiveness and robustness of discrete speech units in different scenarios. 4. **Exploring different discretization methods**: The paper not only uses clustering - based methods, but also tries other discretization techniques, such as vector quantization based on neural codec models, to evaluate the effects of different methods. ### Experimental setup and results: - **Datasets**: The paper uses multiple standard datasets, including LibriSpeech, CHiME4, SWBD, Gigaspeech, TEDLIUM3, etc., covering multiple languages and scenarios. - **Model architectures**: The experiments adopt multiple end - to - end (E2E) models, including Connectionist Temporal Classification (CTC), Attention - based Encoder - Decoder (AED) and RNN - Transducer. - **Performance comparison**: The results show that models using discrete speech units perform well in most cases, with performance between traditional FBank features and SSL features, and on some datasets, approaching or even exceeding the performance of SSL features. - **Efficiency improvement**: Discrete speech units significantly reduce the length of the input sequence, thereby greatly improving the efficiency of training and inference. For example, on the LibriSpeech dataset, the training time using discrete units is less than half of that using FBank features. ### Conclusion: The paper verifies the effectiveness and high efficiency of discrete speech units in various speech processing tasks through extensive experiments. These findings indicate that discrete speech units can not only improve the computational efficiency of the model, but also reduce the requirements for data storage and transmission while maintaining or even improving performance. Future research can further explore more discretization techniques and application scenarios.

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

Improving Textless Spoken Language Understanding with Discrete Units as Intermediate Target

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

ToneUnit: A Speech Discretization Approach for Tonal Language Speech Synthesis

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

SUPERB: Speech Understanding and PERformance Benchmark

UnitSpeech: Speaker-adaptive Speech Synthesis with Untranscribed Data

A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models

Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion

Compact Speech Translation Models via Discrete Speech Units Pretraining

Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation

Seamless: Multilingual Expressive and Streaming Speech Translation

TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding.

Are Discrete Units Necessary for Spoken Language Modeling?

A Streaming End-to-End Framework for Spoken Language Understanding.

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning