Jinchuan Tian,Chunlei Zhang,Jiatong Shi,Hao Zhang,Jianwei Yu,Shinji Watanabe,Dong Yu
Abstract:Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts. Further optimization can be achieved through preference alignment algorithms, which adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content. This study presents a thorough empirical evaluation of how preference alignment algorithms, particularly Direct Preference Optimization (DPO), enhance LM-based TTS. With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores, with the latter two metrics surpassing even human speech in certain evaluations. We also show preference alignment is applicable to low-resource scenarios and effectively generalized to out-of-domain applications.
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to improve the generation quality of language - model - based (LM - based) text - to - speech (TTS) systems, making them more in line with human perception and preference. Specifically, the authors introduce the Preference Alignment (PA) algorithm, especially the Direct Preference Optimization (DPO), to improve the performance of LM - based TTS systems on multiple evaluation metrics.
### Main Problems and Solutions
1. **Problem Description**:
- Although existing LM - based TTS systems are already able to generate high - quality speech, the content generated by these systems may not fully conform to human subjective preferences.
- Although the traditional cross - entropy loss function can maximize the posterior probability of the target sequence, this does not necessarily mean that the generated content is more natural or more popular among humans.
2. **Solution**:
- Introduce the Preference Alignment (PA) algorithm, especially DPO, to adjust the language model so that its output is more in line with human preferences.
- Specifically, DPO significantly improves the speech generated by the language model on multiple evaluation metrics (such as comprehensibility, speaker similarity, proxy subjective scores, etc.) through optimizing the language model.
### Experimental Results
- **Performance Improvement**: By applying DPO, the authors demonstrate its effectiveness under different data volumes and settings. For example, even with only 1 hour of data, DPO can significantly improve the performance of the TTS system.
- **Surpassing Human - Level**: In some evaluations, the TTS system optimized by DPO even surpasses real human speech in terms of speaker similarity and proxy subjective scores.
- **Generalization Ability**: DPO is not only effective within the training set, but also shows consistent improvement on out - of - domain data (such as VCTK).
### Summary
The core objective of this paper is to improve the LM - based TTS system by introducing the preference alignment method, especially DPO, so that the speech it generates is more in line with human subjective preferences, and surpass existing systems on multiple key metrics, and even reach or exceed the quality of human speech.