Abstract:Text-to-speech synthesis plays an essential role in facilitating human-computer interaction. Currently, the predominant approach in Text-to-speech acoustic models selects only the Mel spectrum as an intermediate feature for converting text to speech. However, the Mel spectrograms obtained may exhibit ambiguity in some aspects owing to the limited capability of the Fourier transform to capture mutation signals during the acquisition of the Mel spectrograms. With the aim of improving the clarity of synthesized speech, this study proposes a multi-task learning optimization method and conducts experiments on the Tacotron2 speech synthesis system to demonstrate the effectiveness of the proposed method. The method in the study introduces an additional task: wavelet spectrograms. The continuous wavelet transform has gained significant popularity in various applications, including speech enhancement and speech recognition, which is primarily attributed to its capability to adaptively vary the time-frequency resolution and its excellent performance in capturing non-stationary signals. This study highlights that the clarity of Tacotron2 synthesized speech can be improved by introducing Wavelet-spectrogram as an auxiliary task through theoretical and experimental analysis: a feature extraction network is added, and Wavelet-spectrogram features are extracted from the Mel spectrum output generated by the decoder. Experimental findings indicate that the Mean Opinion Score achieved for the speech synthesized by the model using multi-task learning is 0.17 higher compared to the baseline model. Furthermore, by analyzing the factors contributing to the success of the continuous wavelet transform-based multi-task learning method in the Tacotron2 model, as well as the effectiveness of multi-task learning, the study conjectures that the proposed method has the potential to enhance the performance of other acoustic models.

Perceptual Evaluation Weight Training for Text-to-Speech Synthesis

ViSPer: A Multilingual TTS Approach Based on VITS Using Deep Feature Loss

Perceptual Clustering Based Unit Selection Optimization for Concatenative Text-to-speech Synthesis

Improve Speech Enhancement Using Perception-High-Related Time-Frequency Loss.

CARD Based Context Specified Weights Training Algorism for Unit Selection in Speech Synthesis

Improved unit selection speech synthesis method utilizing subjective evaluation results on synthetic speech

Optimization Method for Unit Selection Speech Synthesis Based on Synthesis Quality Predictions

Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models.

Automatic Conversion from Lexical Words to Prosodic Words for Mandarin Text-to-speech System

Building HMM based unit-selection speech synthesis system using synthetic speech naturalness evaluation score

A novel unit selection method for concatenation speech system using similarity measure

Context features based pre-selection and weight prediction in concatenation speech synthesis system

Objective Evaluation Methods for Chinese Text-To-Speech Systems

HMM-based Unit Selection Speech Synthesis Using Log Likelihood Ratios Derived from Perceptual Data

A multi-task learning speech synthesis optimization method based on CWT: a case study of Tacotron2

NaturalSpeech: End-to-End Text-to-Speech Synthesis with Human-Level Quality

Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

Trainable Unit Selection Speech Synthesis under Statistical Framework

DIA-TTS: Deep-Inherited Attention-Based Text-to-Speech Synthesizer

Unit Selection Speech Synthesis Integrating Automatic Error Detection

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality