Boosting Multi-Speaker Expressive Speech Synthesis with Semi-supervised Contrastive Learning

Xinfa Zhu,Yuke Li,Yi Lei,Ning Jiang,Guoqing Zhao,Lei Xie
2024-04-25
Abstract:This paper aims to build a multi-speaker expressive TTS system, synthesizing a target speaker's speech with multiple styles and emotions. To this end, we propose a novel contrastive learning-based TTS approach to transfer style and emotion across speakers. Specifically, contrastive learning from different levels, i.e. utterance and category level, is leveraged to extract the disentangled style, emotion, and speaker representations from speech for style and emotion transfer. Furthermore, a semi-supervised training strategy is introduced to improve the data utilization efficiency by involving multi-domain data, including style-labeled data, emotion-labeled data, and abundant unlabeled data. To achieve expressive speech with diverse styles and emotions for a target speaker, the learned disentangled representations are integrated into an improved VITS model. Experiments on multi-domain data demonstrate the effectiveness of the proposed method.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to build a multi - speaker expressive text - to - speech (TTS) system that can synthesize speech with multiple styles and emotions for a target speaker. Specifically, the paper proposes a new method based on contrastive learning to achieve cross - speaker style and emotion transfer. #### Main problems: 1. **Multi - speaker, multi - style and multi - emotion speech synthesis**: Current TTS systems, when generating speech with specific styles and emotions, are often only able to handle a single speaker or limited emotions and styles. This paper hopes to expand to multiple speakers and be able to flexibly generate speech with different styles and emotions. 2. **Decoupling speaker timbre, style and emotion**: In speech, speaker timbre, style and emotion are often intertwined, which makes it difficult to control these attributes separately. For example, changing the emotion of speech may inadvertently change the speaker's timbre. Therefore, a method is needed to effectively decouple these attributes. 3. **Using multi - domain data to improve data utilization efficiency**: In order to train a TTS system that can handle multiple styles and emotions, a large amount of diverse data is required. However, data labeled with styles and emotions are usually scarce, while unlabeled data are very abundant. How to effectively use these unlabeled data is also a challenge. #### Solutions: - **Contrastive Learning**: By constructing positive and negative sample pairs at the sentence level and category level, extract decoupled style, emotion and speaker representations. Contrastive learning can effectively learn the required features from the data while maintaining the distinction between these features. - **Semi - supervised training strategy**: Introduce a semi - supervised training strategy, using multi - domain data including style labels, emotion labels and a large amount of unlabeled data to improve data utilization efficiency. This can make full use of the abundant unlabeled data and improve the robustness and generalization ability of the model. - **Improved VITS model**: Integrate the learned decoupled representations into the improved VITS model to achieve high - quality multi - speaker, multi - style and multi - emotion speech synthesis. Through these methods, the paper shows that the proposed framework can generate diverse expressive speech in multi - language and multi - speaker scenarios, even if the target speaker does not have specific styles or emotions in the training data. ### Summary: This paper mainly solves the problem of style and emotion transfer in multi - speaker expressive speech synthesis. Through contrastive learning and semi - supervised training strategies, it effectively decouples style, emotion and speaker timbre, thereby improving the quality and diversity of synthesized speech.