Abstract:Recent advancements in generative models have sparked a significant interest within the machine learning community. Particularly, diffusion models have demonstrated remarkable capabilities in synthesizing images and speech. Studies such as those by Lee et al. (2023), Black et al. (2023), Wang et al. (2023), and Fan et al. (2024) illustrate that Reinforcement Learning with Human Feedback (RLHF) can enhance diffusion models for image synthesis. However, due to architectural differences between these models and those employed in speech synthesis, it remains uncertain whether RLHF could similarly benefit speech synthesis models. In this paper, we explore the practical application of RLHF to diffusion-based text-to-speech synthesis, leveraging the mean opinion score (MOS) as predicted by UTokyo-SaruLab MOS prediction system (Saeki et al., 2022) as a proxy loss. We introduce diffusion model loss-guided RL policy optimization (DLPO) and compare it against other RLHF approaches, employing the NISQA speech quality and naturalness assessment model (Mittag et al., 2021) and human preference experiments for further evaluation. Our results show that RLHF can enhance diffusion-based text-to-speech synthesis models, and, moreover, DLPO can better improve diffusion models in generating natural and high quality speech audios.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **How to improve the naturalness and sound quality of the text - to - speech (TTS) synthesis system based on the diffusion model through reinforcement learning (RL) techniques**. Specifically, the author explored the possibility of applying reinforcement learning with human feedback (RLHF) to the TTS diffusion model and proposed a new method - Diffusion - Model - Loss - Guided Policy Optimization (DLPO) to improve the quality of the generated speech. ### Research Background In recent years, generative models (such as diffusion models) have shown remarkable capabilities in image and speech synthesis. However, although previous studies have shown that RLHF can enhance image synthesis based on diffusion models, due to architectural differences, it is not clear whether these techniques can also improve the performance of TTS models. In addition, TTS models need to process time - domain data and face different time - resolution challenges from image - synthesis models, so directly applying existing RLHF methods may not be effective. ### Main Problems 1. **Can RLHF be applied to the TTS diffusion model?** - The author explored using RLHF to fine - tune the TTS system based on the diffusion model, especially for the method where long - chain diffusion directly acts on waveform data. 2. **How to effectively prevent performance degradation caused by over - optimization of the model?** - Basic policy gradient methods (such as RWR and DDPO) may deviate significantly from the initial model, resulting in a decline in sound quality. Therefore, researchers have proposed several methods to control model deviation, such as DPOK and KLinR, which use KL divergence to limit the magnitude of model updates. 3. **Is there a better RL method suitable for the TTS diffusion model?** - The paper introduced the DLPO method, which effectively prevents the model from over - deviating by using the loss function of the diffusion model as part of the reward, and significantly improves the naturalness and quality of the generated speech. ### Experimental Results Through a series of experiments, the author compared the effects of several RL methods (RWR, DDPO, DPOK, KLinR, and DLPO) on the WaveGrad2 TTS model. The results show that: - **DLPO** is significantly better than other methods and can increase the UTMOS score from 3.0 to 3.68 and the NISQA score from 3.85 to 4.12. - **DPOK** and **KLinR** also have some improvements, but the effects are not as obvious as DLPO. - **RWR** and **DDPO** failed to effectively improve the sound quality and instead led to a decline in sound quality. ### Conclusion This research successfully applied reinforcement learning to improve the speech quality of the TTS diffusion model for the first time and proposed the innovative DLPO method. DLPO not only effectively improves the naturalness and sound quality of the generated speech but also avoids the over - optimization problem of the model that may be brought by other methods. This provides new ideas and technical paths for future TTS research. --- If you have more questions or need further details, please feel free to let me know!

DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

Diffusion Model Alignment Using Direct Preference Optimization

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech

Training Diffusion Models with Reinforcement Learning

DiffVoice: Text-to-Speech with Latent Diffusion

DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

ResGrad: Residual Denoising Diffusion Probabilistic Models for Text to Speech

Scaling Diffusion Language Models via Adaptation from Autoregressive Models

Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding

Boosting Diffusion Model for Spectrogram Up-sampling in Text-to-speech: An Empirical Study

LatentSpeech: Latent Diffusion for Text-To-Speech Generation

A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI

Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward

FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis