Abstract:Recent advancements in text-to-speech (TTS) have shown that language model (LM)-based systems offer competitive performance to their counterparts. Further optimization can be achieved through preference alignment algorithms, which adjust LMs to align with the preferences of reward models, enhancing the desirability of the generated content. This study presents a thorough empirical evaluation of how preference alignment algorithms, particularly Direct Preference Optimization (DPO), enhance LM-based TTS. With a 1.15B parameter LM-based TTS model, we demonstrate that preference alignment consistently improves intelligibility, speaker similarity, and proxy subjective evaluation scores, with the latter two metrics surpassing even human speech in certain evaluations. We also show preference alignment is applicable to low-resource scenarios and effectively generalized to out-of-domain applications.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is to improve the generation quality of language - model - based (LM - based) text - to - speech (TTS) systems, making them more in line with human perception and preference. Specifically, the authors introduce the Preference Alignment (PA) algorithm, especially the Direct Preference Optimization (DPO), to improve the performance of LM - based TTS systems on multiple evaluation metrics. ### Main Problems and Solutions 1. **Problem Description**: - Although existing LM - based TTS systems are already able to generate high - quality speech, the content generated by these systems may not fully conform to human subjective preferences. - Although the traditional cross - entropy loss function can maximize the posterior probability of the target sequence, this does not necessarily mean that the generated content is more natural or more popular among humans. 2. **Solution**: - Introduce the Preference Alignment (PA) algorithm, especially DPO, to adjust the language model so that its output is more in line with human preferences. - Specifically, DPO significantly improves the speech generated by the language model on multiple evaluation metrics (such as comprehensibility, speaker similarity, proxy subjective scores, etc.) through optimizing the language model. ### Experimental Results - **Performance Improvement**: By applying DPO, the authors demonstrate its effectiveness under different data volumes and settings. For example, even with only 1 hour of data, DPO can significantly improve the performance of the TTS system. - **Surpassing Human - Level**: In some evaluations, the TTS system optimized by DPO even surpasses real human speech in terms of speaker similarity and proxy subjective scores. - **Generalization Ability**: DPO is not only effective within the training set, but also shows consistent improvement on out - of - domain data (such as VCTK). ### Summary The core objective of this paper is to improve the LM - based TTS system by introducing the preference alignment method, especially DPO, so that the speech it generates is more in line with human subjective preferences, and surpass existing systems on multiple key metrics, and even reach or exceed the quality of human speech.

Preference Alignment Improves Language Model-Based TTS

SpeechAlign: Aligning Speech Generation to Human Preferences

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

Direct Preference Optimization Using Sparse Feature-Level Constraints

Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization

A Deep Dive into the Trade-Offs of Parameter-Efficient Preference Alignment Techniques

Mixed Preference Optimization: Reinforcement Learning with Data Selection and Better Reference Model

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Statistical Rejection Sampling Improves Preference Optimization

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback

ULMA: Unified Language Model Alignment with Human Demonstration and Point-wise Preference

One TTS Alignment To Rule Them All

Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment

Token-level Direct Preference Optimization

Preference Ranking Optimization for Human Alignment

Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

Adversarial Preference Optimization: Enhancing Your Alignment via RM-LLM Game

Parameter-Efficient Tuning Helps Language Model Alignment