Abstract:Diffusion models have demonstrated strong performance in sampling and editing multi-modal data with high generation quality, yet they suffer from the iterative generation process which is computationally expensive and slow. In addition, most methods are constrained to generate data from Gaussian noise, which limits their sampling and editing flexibility. To overcome both disadvantages, we present Contrastive Optimal Transport Flow (COT Flow), a new method that achieves fast and high-quality generation with improved zero-shot editing flexibility compared to previous diffusion models. Benefiting from optimal transport (OT), our method has no limitation on the prior distribution, enabling unpaired image-to-image (I2I) translation and doubling the editable space (at both the start and end of the trajectory) compared to other zero-shot editing methods. In terms of quality, COT Flow can generate competitive results in merely one step compared to previous state-of-the-art unpaired image-to-image (I2I) translation methods. To highlight the advantages of COT Flow through the introduction of OT, we introduce the COT Editor to perform user-guided editing with excellent flexibility and quality. The code will be released at <a class="link-external link-https" href="https://github.com/zuxinrui/cot_flow" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are that in image generation and editing tasks, the existing diffusion models have two major limitations: 1. **Low sampling efficiency**: Diffusion models usually require an iterative generation process, which is computationally very expensive and slow. 2. **Limitations of prior distribution**: Most methods are limited to generating data from Gaussian noise, which restricts the flexibility of their sampling and editing. To solve these problems, the author proposes a new method named "Contrastive Optimal Transport Flow (COT Flow)". COT Flow overcomes the above - mentioned shortcomings by introducing the Optimal Transport (OT) theory, specifically: - **Improving sampling efficiency**: COT Flow can achieve fast one - step or multi - step high - quality image generation, greatly reducing the computational cost compared with traditional diffusion models. - **Enhancing editing flexibility**: COT Flow not only supports unpaired image - to - image translation, but also allows users to perform flexible zero - shot image editing, that is, images can be edited without additional training. In addition, COT Flow further enhances the editing ability by introducing the COT Editor, including but not limited to the following scenarios: - **COT Composition**: Users can synthesize elements and generate realistic images. - **Shape - Texture Coupling**: Users can draw shapes and textures respectively as dual inputs to generate high - quality images that integrate the characteristics of both. - **COT Augmentation**: It provides application scenarios such as medical image synthesis. In short, COT Flow aims to solve the problems of inefficiency and lack of flexibility in the generation and editing tasks of existing diffusion models, and at the same time provides a new framework to achieve fast, high - quality image generation and flexible zero - shot editing. ### Summary of Mathematical Formulas Several key formulas involved in the paper are as follows: 1. **Optimal transport cost**: \[ \text{Cost}(\mu, \nu) := \inf_{\pi \in \Pi(\mu, \nu)} \int_{X \times Y} c(x, y) d\pi(x, y) \] where $\Pi(\mu, \nu)$ is the set of joint probability distributions, and $c(x, y)$ is the cost function. 2. **Contrastive learning loss function**: \[ L(\theta, \theta^-) := d(q_\theta(E_\theta(x)), E_{\theta^-}(x^+)) \] where $E$ is the encoder network, $\theta^-$ is the exponential moving average (EMA) of the parameter $\theta$, and $d(·, ·)$ is the distance function. 3. **Consistency model loss function**: \[ L_N(\theta, \theta^-) := \mathbb{E}\left[\lambda(t_i) d(f_\theta(x_{t_{i + 1}}, t_{i + 1}), f_{\theta^-}(x_{t_i}, t_i))\right], \quad i \sim U[1, N - 1] \] 4. **COT Pairs loss function**: \[ L_{\text{COT}}(\theta) = d(E_\theta(x_{t_1}, t_1), E_\theta(x_{t_2}, t_2)), \quad 0 \leq t_1 < t_2 \leq 1 \] These formulas show how COT Flow combines the core ideas of optimal transport, contrastive learning and consistency models to achieve efficient, high - quality image generation.

COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

FlowEdit: Inversion-Free Text-Based Editing Using Pre-Trained Flow Models

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

CoT-AMFlow: Adaptive Modulation Network with Co-Teaching Strategy for Unsupervised Optical Flow Estimation

FlowTurbo: Towards Real-time Flow-Based Image Generation with Velocity Refiner

FlowIE：Efficient Image Enhancement Via Rectified Flow

FlowIE: Efficient Image Enhancement via Rectified Flow

Stable Flow: Vital Layers for Training-Free Image Editing

FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models

StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video Sequences

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

Optical Flow as Spatial-Temporal Attention Learners

TransFlow: Transformer as Flow Learner

CLIP-FLOW: CONTRASTIVE LEARNING WITH ITERATIVE PSEUDO LABELING FOR OPTICAL FLOW

CLIP-FLow: Contrastive Learning by semi-supervised Iterative Pseudo labeling for Optical Flow Estimation

Self-Corrected Flow Distillation for Consistent One-Step and Few-Step Text-to-Image Generation

Exploring Straighter Trajectories of Flow Matching with Diffusion Guidance

Dynamic Conditional Optimal Transport through Simulation-Free Flows

Optimal Transport-Guided Conditional Score-Based Diffusion Models

Latent Space Editing in Transformer-Based Flow Matching

Improving and generalizing flow-based generative models with minibatch optimal transport