RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

Owen Oertell,Jonathan D. Chang,Yiyi Zhang,Kianté Brantley,Wen Sun
2024-06-22
Abstract:Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Our code is available at <a class="link-external link-https" href="https://rlcm.owenoertell.com" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address two main issues: 1. **Difficulty in generating high-quality images**: - In text-to-image generation tasks, diffusion models can generate realistic images, but their generation process relies on step-by-step denoising, making it difficult to generate high-quality images, especially when downstream task goals need to be specified through prompts. - Diffusion models face challenges in generating images required for specific tasks because the goals of these tasks are difficult to express through simple prompts. 2. **Slow generation speed**: - The generation process of diffusion models requires multiple iterations, leading to slow generation speed. This not only increases inference time but also makes the iterative process of prompt tuning computationally intensive. - Slow generation and training times pose obstacles to practical applications, especially in scenarios requiring quick responses or large-scale generation. To overcome these issues, the authors propose a new framework—Reinforcement Learning for Consistency Models (RLCM). This framework models the multi-step inference process of consistency models as a Markov Decision Process (MDP) and uses reinforcement learning methods to optimize the reward function for specific tasks. Compared to existing diffusion model-based methods, RLCM has the following advantages: - **Faster training and inference speed**: Due to the shorter inference process of consistency models, RLCM can complete training and generate high-quality images in less time. - **Higher generation quality**: RLCM demonstrates better performance across multiple tasks (such as image compressibility, incompressibility, prompt-image alignment, and aesthetic scoring), generating higher quality images. - **Better generalization ability**: RLCM maintains good generation quality on unseen prompts, showing strong generalization capabilities. In summary, RLCM provides a faster and higher-quality solution for text-to-image generation tasks by combining the efficient inference capabilities of consistency models with the optimization capabilities of reinforcement learning.