Diffsound: Discrete Diffusion Model for Text-to-Sound Generation
Dongchao Yang,Jianwei Yu,Helin Wang,Wen Wang,Chao Weng,Yuexian Zou,Dong Yu
DOI: https://doi.org/10.1109/taslp.2023.3268730
2023-01-01
Abstract:Generating sound effects that people want is an important topic. However, there are limited studies in this area for sound generation. In this study, we investigate generating sound conditioned on a text prompt and propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a token-decoder, and a vocoder. The framework first uses the token-decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform. We found that the token-decoder significantly influences the generation performance. Thus, we focus on designing a good token-decoder in this study. We begin with the traditional autoregressive (AR) token-decoder. However, the AR token-decoder always predicts the mel-spectrogram tokens one by one in order, which may introduce the unidirectional bias and accumulation of errors problems. Moreover, with the AR token-decoder, the sound generation time increases linearly with the sound duration. To overcome the shortcomings introduced by AR token-decoders, we propose a non-autoregressive token-decoder based on the discrete diffusion model, named Diffsound. Specifically, the Diffsound model predicts all of the mel-spectrogram tokens in one step and then refines the predicted tokens in the next step, so the best-predicted results can be obtained by iteration. Our experiments show that our proposed Diffsound model not only produces better generation results when compared with the AR token-decoder but also has a faster generation speed, i.e., MOS: 3.56 v.s 2.786.
engineering, electrical & electronic,acoustics