A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Guoqiang Hu,Huaning Tan,Ruilai Li
2024-07-10
Abstract:Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in current speech synthesis systems based on Mel spectrogram, when dealing with abrupt - change signals, the clarity of the synthesized speech is affected due to the loss of fine - grained information caused by the Fourier transform process. In order to obtain a more detailed Mel spectrogram and thus improve the quality of the synthesized speech, the author proposes a Mel spectrogram enhancement paradigm based on Continuous Wavelet Transform (CWT). Specifically, the paper mainly solves the following problems: 1. **Loss of fine - grained information**: Due to the fixed window length and frequency resolution limitations in the Fourier transform process, the traditional Mel spectrogram cannot well capture the transient changes and high - frequency details in the speech signal. 2. **Adaptability to abrupt - change signals**: The basis functions of the Fourier transform are difficult to adapt to the changes of abrupt - change signals, resulting in poor performance when dealing with non - stationary signals. 3. **Improving the quality of synthesized speech**: By introducing a more detailed waveform spectrogram as an auxiliary task, the Mel spectrogram is forced to learn in a more detailed direction, thereby improving the clarity and expressiveness of the synthesized speech. To solve these problems, the author proposes an enhancement paradigm consisting of three main components: - **Mel Spectrogram Decoder**: Generate the initial Mel spectrogram. - **CWT - Net**: Use continuous wavelet transform to refine the Mel spectrogram and generate a more detailed waveform spectrogram. - **Post - Net**: Reconstruct and enhance the Mel spectrogram to further improve its quality. Through experimental verification, this method has achieved significant results in two speech synthesis models with different architectures, Tacotron2 and Fastspeech2, increasing the MOS scores by 0.14 and 0.09 respectively, which proves the effectiveness and universality of this enhancement paradigm.