Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

Xuankai Chang,Brian Yan,Kwanghee Choi,Jeeweon Jung,Yichen Lu,Soumi Maiti,Roshan Sharma,Jiatong Shi,Jinchuan Tian,Shinji Watanabe,Yuya Fujita,Takashi Maekaku,Pengcheng Guo,Yao-Fei Cheng,Pavel Denisov,Kohei Saijo,Hsiu-Hsuan Wang
2023-09-28
Abstract:Speech signals, typically sampled at rates in the tens of thousands per second, contain redundancies, evoking inefficiencies in sequence modeling. High-dimensional speech features such as spectrograms are often used as the input for the subsequent model. However, they can still be redundant. Recent investigations proposed the use of discrete speech units derived from self-supervised learning representations, which significantly compresses the size of speech data. Applying various methods, such as de-duplication and subword modeling, can further compress the speech sequence length. Hence, training time is significantly reduced while retaining notable performance. In this study, we undertake a comprehensive and systematic exploration into the application of discrete units within end-to-end speech processing models. Experiments on 12 automatic speech recognition, 3 speech translation, and 1 spoken language understanding corpora demonstrate that discrete units achieve reasonably good results in almost all the settings. We intend to release our configurations and trained models to foster future research efforts.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the efficiency and performance of models in Automatic Speech Recognition (ASR), Speech Translation (ST) and Spoken Language Understanding (SLU) tasks by using discrete speech units. Specifically, the paper explores how to use the discrete speech units extracted from self - supervised learning (SSL) models to replace traditional high - dimensional continuous speech features, in order to reduce the size of data storage and transmission while maintaining or improving the predictive performance of the model. ### Main research questions: 1. **Improving computational efficiency**: By using discrete speech units, the paper aims to significantly reduce training time and inference time without degrading model performance. 2. **Reducing redundancy**: Traditional high - dimensional speech features (such as Mel - spectrograms) contain a large amount of redundant information, resulting in low efficiency in model training and inference. The paper proposes to further compress the sequence length through methods such as deduplication and sub - word modeling to improve efficiency. 3. **Verifying wide applicability**: The paper conducts experiments on a variety of different types of speech datasets, including clean pronunciation data, noisy data, telephone data and multilingual data, to verify the effectiveness and robustness of discrete speech units in different scenarios. 4. **Exploring different discretization methods**: The paper not only uses clustering - based methods, but also tries other discretization techniques, such as vector quantization based on neural codec models, to evaluate the effects of different methods. ### Experimental setup and results: - **Datasets**: The paper uses multiple standard datasets, including LibriSpeech, CHiME4, SWBD, Gigaspeech, TEDLIUM3, etc., covering multiple languages and scenarios. - **Model architectures**: The experiments adopt multiple end - to - end (E2E) models, including Connectionist Temporal Classification (CTC), Attention - based Encoder - Decoder (AED) and RNN - Transducer. - **Performance comparison**: The results show that models using discrete speech units perform well in most cases, with performance between traditional FBank features and SSL features, and on some datasets, approaching or even exceeding the performance of SSL features. - **Efficiency improvement**: Discrete speech units significantly reduce the length of the input sequence, thereby greatly improving the efficiency of training and inference. For example, on the LibriSpeech dataset, the training time using discrete units is less than half of that using FBank features. ### Conclusion: The paper verifies the effectiveness and high efficiency of discrete speech units in various speech processing tasks through extensive experiments. These findings indicate that discrete speech units can not only improve the computational efficiency of the model, but also reduce the requirements for data storage and transmission while maintaining or even improving performance. Future research can further explore more discretization techniques and application scenarios.