Abstract:Reinforcement learning (RL) has improved guided image generation with diffusion models by directly optimizing rewards that capture image quality, aesthetics, and instruction following capabilities. However, the resulting generative policies inherit the same iterative sampling process of diffusion models that causes slow generation. To overcome this limitation, consistency models proposed learning a new class of generative models that directly map noise to data, resulting in a model that can generate an image in as few as one sampling iteration. In this work, to optimize text-to-image generative models for task specific rewards and enable fast training and inference, we propose a framework for fine-tuning consistency models via RL. Our framework, called Reinforcement Learning for Consistency Model (RLCM), frames the iterative inference process of a consistency model as an RL procedure. Comparing to RL finetuned diffusion models, RLCM trains significantly faster, improves the quality of the generation measured under the reward objectives, and speeds up the inference procedure by generating high quality images with as few as two inference steps. Experimentally, we show that RLCM can adapt text-to-image consistency models to objectives that are challenging to express with prompting, such as image compressibility, and those derived from human feedback, such as aesthetic quality. Our code is available at <a class="link-external link-https" href="https://rlcm.owenoertell.com" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address two main issues: 1. **Difficulty in generating high-quality images**: - In text-to-image generation tasks, diffusion models can generate realistic images, but their generation process relies on step-by-step denoising, making it difficult to generate high-quality images, especially when downstream task goals need to be specified through prompts. - Diffusion models face challenges in generating images required for specific tasks because the goals of these tasks are difficult to express through simple prompts. 2. **Slow generation speed**: - The generation process of diffusion models requires multiple iterations, leading to slow generation speed. This not only increases inference time but also makes the iterative process of prompt tuning computationally intensive. - Slow generation and training times pose obstacles to practical applications, especially in scenarios requiring quick responses or large-scale generation. To overcome these issues, the authors propose a new framework—Reinforcement Learning for Consistency Models (RLCM). This framework models the multi-step inference process of consistency models as a Markov Decision Process (MDP) and uses reinforcement learning methods to optimize the reward function for specific tasks. Compared to existing diffusion model-based methods, RLCM has the following advantages: - **Faster training and inference speed**: Due to the shorter inference process of consistency models, RLCM can complete training and generate high-quality images in less time. - **Higher generation quality**: RLCM demonstrates better performance across multiple tasks (such as image compressibility, incompressibility, prompt-image alignment, and aesthetic scoring), generating higher quality images. - **Better generalization ability**: RLCM maintains good generation quality on unseen prompts, showing strong generalization capabilities. In summary, RLCM provides a faster and higher-quality solution for text-to-image generation tasks by combining the efficient inference capabilities of consistency models with the optimization capabilities of reinforcement learning.

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

Chasing Consistency in Text-to-3D Generation from a Single Image.

Emage: Non-Autoregressive Text-to-Image Generation

Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning

Reward Incremental Learning in Text-to-Image Generation

Large-scale Reinforcement Learning for Diffusion Models

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

Class-Conditional self-reward mechanism for improved Text-to-Image models

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

Powerful and Flexible: Personalized Text-to-Image Generation via Reinforcement Learning

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Generalizing Consistency Policy to Visual RL with Prioritized Proximal Experience Regularization

Reward Guided Latent Consistency Distillation

TLCM: Training-efficient Latent Consistency Model for Image Generation with 2-8 Steps

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferences

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

Reward Fine-Tuning Two-Step Diffusion Models via Learning Differentiable Latent-Space Surrogate Reward