Abstract:Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in current speech synthesis systems based on Mel spectrogram, when dealing with abrupt - change signals, the clarity of the synthesized speech is affected due to the loss of fine - grained information caused by the Fourier transform process. In order to obtain a more detailed Mel spectrogram and thus improve the quality of the synthesized speech, the author proposes a Mel spectrogram enhancement paradigm based on Continuous Wavelet Transform (CWT). Specifically, the paper mainly solves the following problems: 1. **Loss of fine - grained information**: Due to the fixed window length and frequency resolution limitations in the Fourier transform process, the traditional Mel spectrogram cannot well capture the transient changes and high - frequency details in the speech signal. 2. **Adaptability to abrupt - change signals**: The basis functions of the Fourier transform are difficult to adapt to the changes of abrupt - change signals, resulting in poor performance when dealing with non - stationary signals. 3. **Improving the quality of synthesized speech**: By introducing a more detailed waveform spectrogram as an auxiliary task, the Mel spectrogram is forced to learn in a more detailed direction, thereby improving the clarity and expressiveness of the synthesized speech. To solve these problems, the author proposes an enhancement paradigm consisting of three main components: - **Mel Spectrogram Decoder**: Generate the initial Mel spectrogram. - **CWT - Net**: Use continuous wavelet transform to refine the Mel spectrogram and generate a more detailed waveform spectrogram. - **Post - Net**: Reconstruct and enhance the Mel spectrogram to further improve its quality. Through experimental verification, this method has achieved significant results in two speech synthesis models with different architectures, Tacotron2 and Fastspeech2, increasing the MOS scores by 0.14 and 0.09 respectively, which proves the effectiveness and universality of this enhancement paradigm.

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

A multi-task learning speech synthesis optimization method based on CWT: a case study of Tacotron2

Autoregressive Speech Synthesis without Vector Quantization

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation

Speech Enhancement Using Non-Negative Spectrogram Models With Mel-Generalized Cepstral Regularization

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

Multi-Band Melgan: Faster Waveform Generation For High-Quality Text-To-Speech

Multi-band melgan: fasterwaveform generation for high-quality text-to-speech

MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis

A Speech Enhancement Algorithm Based on Computational Auditory Scene Analysis

Perceptually Weighted Mel-Cepstrum Analysis of Speech Based on Psychoacoustic Model

A Transformer-based Chinese Non-autoregressive Speech Synthesis Scheme

Cross-Attention-Guided Wavenet for Mel Spectrogram Reconstruction in the ICASSP 2024 Auditory EEG Challenge

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

DelightfulTTS 2: End-to-End Speech Synthesis with Adversarial Vector-Quantized Auto-Encoders

dMel: Speech Tokenization made Simple

CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models

Neural Speech Synthesis with Transformer Network.

R-MelNet: Reduced Mel-Spectral Modeling for Neural TTS

W2VC: WavLM representation based one-shot voice conversion with gradient reversal distillation and CTC supervision