COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs

Xinrui Zu,Qian Tao
2024-06-18
Abstract:Diffusion models have demonstrated strong performance in sampling and editing multi-modal data with high generation quality, yet they suffer from the iterative generation process which is computationally expensive and slow. In addition, most methods are constrained to generate data from Gaussian noise, which limits their sampling and editing flexibility. To overcome both disadvantages, we present Contrastive Optimal Transport Flow (COT Flow), a new method that achieves fast and high-quality generation with improved zero-shot editing flexibility compared to previous diffusion models. Benefiting from optimal transport (OT), our method has no limitation on the prior distribution, enabling unpaired image-to-image (I2I) translation and doubling the editable space (at both the start and end of the trajectory) compared to other zero-shot editing methods. In terms of quality, COT Flow can generate competitive results in merely one step compared to previous state-of-the-art unpaired image-to-image (I2I) translation methods. To highlight the advantages of COT Flow through the introduction of OT, we introduce the COT Editor to perform user-guided editing with excellent flexibility and quality. The code will be released at <a class="link-external link-https" href="https://github.com/zuxinrui/cot_flow" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are that in image generation and editing tasks, the existing diffusion models have two major limitations: 1. **Low sampling efficiency**: Diffusion models usually require an iterative generation process, which is computationally very expensive and slow. 2. **Limitations of prior distribution**: Most methods are limited to generating data from Gaussian noise, which restricts the flexibility of their sampling and editing. To solve these problems, the author proposes a new method named "Contrastive Optimal Transport Flow (COT Flow)". COT Flow overcomes the above - mentioned shortcomings by introducing the Optimal Transport (OT) theory, specifically: - **Improving sampling efficiency**: COT Flow can achieve fast one - step or multi - step high - quality image generation, greatly reducing the computational cost compared with traditional diffusion models. - **Enhancing editing flexibility**: COT Flow not only supports unpaired image - to - image translation, but also allows users to perform flexible zero - shot image editing, that is, images can be edited without additional training. In addition, COT Flow further enhances the editing ability by introducing the COT Editor, including but not limited to the following scenarios: - **COT Composition**: Users can synthesize elements and generate realistic images. - **Shape - Texture Coupling**: Users can draw shapes and textures respectively as dual inputs to generate high - quality images that integrate the characteristics of both. - **COT Augmentation**: It provides application scenarios such as medical image synthesis. In short, COT Flow aims to solve the problems of inefficiency and lack of flexibility in the generation and editing tasks of existing diffusion models, and at the same time provides a new framework to achieve fast, high - quality image generation and flexible zero - shot editing. ### Summary of Mathematical Formulas Several key formulas involved in the paper are as follows: 1. **Optimal transport cost**: \[ \text{Cost}(\mu, \nu) := \inf_{\pi \in \Pi(\mu, \nu)} \int_{X \times Y} c(x, y) d\pi(x, y) \] where $\Pi(\mu, \nu)$ is the set of joint probability distributions, and $c(x, y)$ is the cost function. 2. **Contrastive learning loss function**: \[ L(\theta, \theta^-) := d(q_\theta(E_\theta(x)), E_{\theta^-}(x^+)) \] where $E$ is the encoder network, $\theta^-$ is the exponential moving average (EMA) of the parameter $\theta$, and $d(·, ·)$ is the distance function. 3. **Consistency model loss function**: \[ L_N(\theta, \theta^-) := \mathbb{E}\left[\lambda(t_i) d(f_\theta(x_{t_{i + 1}}, t_{i + 1}), f_{\theta^-}(x_{t_i}, t_i))\right], \quad i \sim U[1, N - 1] \] 4. **COT Pairs loss function**: \[ L_{\text{COT}}(\theta) = d(E_\theta(x_{t_1}, t_1), E_\theta(x_{t_2}, t_2)), \quad 0 \leq t_1 < t_2 \leq 1 \] These formulas show how COT Flow combines the core ideas of optimal transport, contrastive learning and consistency models to achieve efficient, high - quality image generation.