Abstract:Consistency Models (CMs) have made significant progress in accelerating the generation of diffusion models. However, their application to high-resolution, text-conditioned image generation in the latent space remains unsatisfactory. In this paper, we identify three key flaws in the current design of Latent Consistency Models (LCMs). We investigate the reasons behind these limitations and propose Phased Consistency Models (PCMs), which generalize the design space and address the identified limitations. Our evaluations demonstrate that PCMs outperform LCMs across 1--16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. Our code is available at <a class="link-external link-https" href="https://github.com/G-U-N/Phased-Consistency-Model" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of the poor performance of existing Latent Consistency Models (LCMs) in high - resolution, text - conditioned image generation. Specifically, the author identifies three key flaws in the design of LCMs and proposes Phased Consistency Models (PCMs) to improve these problems. The following is a detailed description of these flaws: 1. **Consistency**: - LCMs can only use the pure random multi - step sampling algorithm, which leads to inconsistencies between samples generated at different inference steps. For example, the results generated using the same seed but at different inference steps may be different. 2. **Controllability**: - LCMs have a low sensitivity to classifier - free guidance (CFG) and can only accept small CFG values (less than 2). Larger CFG values will lead to exposure problems. In addition, LCMs are also not sensitive to negative prompts and cannot effectively avoid generating specific content (as shown in the figure, even if "black dog" is set as a negative prompt, a black dog is still generated). 3. **Efficiency**: - In the few - step setting (such as less than 4 inference steps), the quality of samples generated by LCMs is poor. The author believes that this is because the traditional L2 loss or Huber loss is not sufficient to provide fine - grained supervision. To overcome these problems, the author proposes Phased Consistency Models (PCMs). By dividing the ODE trajectory into multiple sub - trajectories and enforcing the self - consistency property on each sub - trajectory, more stable and efficient generation is achieved. PCMs not only improve the performance of multi - step inference but also achieve results comparable to those of specially designed one - step methods in one - step generation. In addition, the PCMs method is also applicable to video generation and can train state - of - the - art few - step text - to - video generators. ### Summary The main contributions of this paper are: - **Analyzing and solving three key flaws of LCMs**: consistency, controllability, and efficiency. - **Proposing Phased Consistency Models (PCMs)**: improving the performance of the generation model by processing the ODE trajectory in phases. - **Verifying the superiority of PCMs in image and video generation tasks**: especially in multi - step inference and few - step generation. Through these improvements, PCMs can better handle high - resolution, text - conditioned image and video generation tasks, providing higher generation quality and better controllability.

Phased Consistency Models

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

VideoLCM: Video Latent Consistency Model

Chasing Consistency in Text-to-3D Generation from a Single Image.

TLCM: Training-efficient Latent Consistency Model for Image Generation with 2-8 Steps

CCM: Real-Time Controllable Visual Content Creation Using Text-to-Image Consistency Models

MotionLCM: Real-time Controllable Motion Generation via Latent Consistency Model

DreamLCM: Towards High-Quality Text-to-3D Generation via Latent Consistency Model

Consistency Models Made Easy

Bidirectional Consistency Models

AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models

Efficient Text-driven Motion Generation via Latent Consistency Training

CCM: Adding Conditional Controls to Text-to-Image Consistency Models

Truncated Consistency Models

Consistency^2: Consistent and Fast 3D Painting with Latent Consistency Models

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

Convergence guarantee for consistency models