Abstract:We present UTDUSS, the UTokyo-SaruLab system submitted to Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge. The challenge focuses on using discrete speech unit learned from large speech corpora for some tasks. We submitted our UTDUSS system to two text-to-speech tracks: Vocoder and Acoustic+Vocoder. Our system incorporates neural audio codec (NAC) pre-trained on only speech corpora, which makes the learned codec represent rich acoustic features that are necessary for high-fidelity speech reconstruction. For the acoustic+vocoder track, we trained an acoustic model based on Transformer encoder-decoder that predicted the pre-trained NAC tokens from text input. We describe our strategies to build these models, such as data selection, downsampling, and hyper-parameter tuning. Our system ranked in second and first for the Vocoder and Acoustic+Vocoder tracks, respectively.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to use discrete speech units for high - quality speech processing tasks, especially in text - to - speech (TTS) synthesis applications. Specifically, the paper mainly focuses on two aspects: 1. **Vocoder Track**: - The task is to build a vocoder model that can convert discrete speech units into corresponding waveforms. - The challenge lies in how to reduce the bitrate while ensuring the naturalness of speech (measured by UTMOS), so as to achieve efficient and high - quality speech synthesis. 2. **Acoustic + Vocoder Track**: - The task is to build a combined system of an acoustic model and a vocoder model, where the acoustic model predicts discrete speech units from the input text, and the vocoder then converts these discrete units into waveforms. - The challenge lies in how to design an efficient end - to - end TTS system so that the generated speech is both natural and accurately reflects the content of the input text. ### Specific Problems and Challenges - **Application of Discrete Speech Units**: How to effectively use discrete speech units learned from large - scale speech corpora for speech processing tasks, especially text - to - speech synthesis. - **High - Fidelity Speech Reconstruction**: How to ensure that the generated speech has rich acoustic features to achieve high - fidelity speech reconstruction. - **Data Selection and Pre - processing**: How to select and process training data to improve model performance. For example, excluding data with untypical speaking styles, adjusting the sampling rate, etc. - **Hyper - parameter Tuning**: How to optimize model performance through hyper - parameter tuning, especially the trade - off between UTMOS and bitrate. ### Solutions The paper proposes the UTDUSS system, which adopts the Neural Audio Codec (NAC) and a Transformer - based acoustic model, and solves the above problems through the following strategies: - **Vocoder Track**: - Use the DAC model for discrete speech representation extraction and vocoder tasks. - Improve the UTMOS index through hyper - parameter tuning, matching the sampling rate, and excluding data with untypical speaking styles. - **Acoustic + Vocoder Track**: - Use the DAC model for discrete speech unit extraction, and train a Transformer encoder - decoder - based acoustic model. - Adopt a novel sampling strategy combining top - k and top - p sampling to diversify the output and reduce the problem of repeated sequences. - Optimize the quality of the generated speech through hyper - parameter tuning. Finally, the UTDUSS system has achieved excellent results in both tracks, ranking second in the Vocoder track and first in the Acoustic + Vocoder track respectively.

UTDUSS: UTokyo-SaruLab System for Interspeech2024 Speech Processing Using Discrete Speech Unit Challenge

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation Challenge

The X-LANCE Technical Report for Interspeech 2024 Speech Processing Using Discrete Speech Unit Challenge

Transformer VQ-VAE for Unsupervised Unit Discovery and Speech Synthesis: ZeroSpeech 2020 Challenge

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

The DKU-DukeECE-Lenovo System for the Diarization Task of the 2021 VoxCeleb Speaker Recognition Challenge

The huya multi-speaker and multi-style speech synthesis system for m2voc challenge 2020

The USTC System for Blizzard Challenge 2012

Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE

Direct Text to Speech Translation System using Acoustic Units

The DKU-MSXF Diarization System for the VoxCeleb Speaker Recognition Challenge 2023

Transsion TSUP's speech recognition system for ASRU 2023 MADASR Challenge

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

Exploring Speech Recognition, Translation, and Understanding with Discrete Speech Units: A Comparative Study

USTC-KXDIGIT System Description for ASVspoof5 Challenge

U-DiT TTS: U-Diffusion Vision Transformer for Text-to-Speech

The NPU System for the 2020 Personalized Voice Trigger Challenge

The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech

The USTC System for Cadenza 2024 Challenge

The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge

The NTU-AISG Text-to-speech System for Blizzard Challenge 2020