UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

Wataru Nakata,Kazuki Yamauchi,Dong Yang,Hiroaki Hyodo,Yuki Saito
2024-03-21
Abstract:We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to use discrete speech units for high - quality speech processing tasks, especially in text - to - speech (TTS) synthesis applications. Specifically, the paper mainly focuses on two aspects: 1. **Vocoder Track**: - The task is to build a vocoder model that can convert discrete speech units into corresponding waveforms. - The challenge lies in how to reduce the bitrate while ensuring the naturalness of speech (measured by UTMOS), so as to achieve efficient and high - quality speech synthesis. 2. **Acoustic + Vocoder Track**: - The task is to build a combined system of an acoustic model and a vocoder model, where the acoustic model predicts discrete speech units from the input text, and the vocoder then converts these discrete units into waveforms. - The challenge lies in how to design an efficient end - to - end TTS system so that the generated speech is both natural and accurately reflects the content of the input text. ### Specific Problems and Challenges - **Application of Discrete Speech Units**: How to effectively use discrete speech units learned from large - scale speech corpora for speech processing tasks, especially text - to - speech synthesis. - **High - Fidelity Speech Reconstruction**: How to ensure that the generated speech has rich acoustic features to achieve high - fidelity speech reconstruction. - **Data Selection and Pre - processing**: How to select and process training data to improve model performance. For example, excluding data with untypical speaking styles, adjusting the sampling rate, etc. - **Hyper - parameter Tuning**: How to optimize model performance through hyper - parameter tuning, especially the trade - off between UTMOS and bitrate. ### Solutions The paper proposes the UTDUSS system, which adopts the Neural Audio Codec (NAC) and a Transformer - based acoustic model, and solves the above problems through the following strategies: - **Vocoder Track**: - Use the DAC model for discrete speech representation extraction and vocoder tasks. - Improve the UTMOS index through hyper - parameter tuning, matching the sampling rate, and excluding data with untypical speaking styles. - **Acoustic + Vocoder Track**: - Use the DAC model for discrete speech unit extraction, and train a Transformer encoder - decoder - based acoustic model. - Adopt a novel sampling strategy combining top - k and top - p sampling to diversify the output and reduce the problem of repeated sequences. - Optimize the quality of the generated speech through hyper - parameter tuning. Finally, the UTDUSS system has achieved excellent results in both tracks, ranking second in the Vocoder track and first in the Acoustic + Vocoder track respectively.