Abstract:Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.

What problem does this paper attempt to address?

This paper attempts to solve the consistency problem in multi - view image generation, especially the geometric and appearance consistency in non - overlapping regions. Specifically, although existing text - to - image generation methods based on diffusion models have made significant progress in single - view image generation, they still face challenges when generating multi - view consistent images. These problems are mainly reflected as follows: 1. **Consistency in non - overlapping regions**: Existing methods perform poorly when dealing with non - overlapping regions between views, resulting in inconsistent appearance and geometric structures in these regions. 2. **Differences in initialization noise**: There is an information gap in the noise used in the training and inference processes, which affects the consistency of multi - view image generation. 3. **Insufficient feature alignment**: Existing methods have limited effectiveness in aligning features between different views, especially in non - overlapping regions. To solve these problems, the authors propose a new diffusion process, which combines coordinate noise and Fourier attention mechanism, and improves the consistency of multi - view image generation by introducing cross - attention loss. The following are the specific technical contributions: ### 1. Coordinate noise initialization The authors propose a new noise initialization technique to initialize noise samples by sharing noise and low - spatial - frequency information. The specific formulas are as follows: \[ \hat{\epsilon}_i = w\cdot c_i+(1 - w)\cdot\epsilon_{\text{shared}} \] \[ \hat{z}_T^i=\sqrt{\bar{\alpha}_T}\hat{\epsilon}_i+\sqrt{1 - \bar{\alpha}_T}\epsilon_i \] where \(c_i\) represents the low - frequency coordinate/depth map information of the \(i\)-th view, \(\epsilon_{\text{shared}}\) is the noise shared by all views, \(\epsilon_i\) is the Gaussian noise independently sampled for each view, and \(w\) is a weight parameter. ### 2. Fourier attention module To align the features in non - overlapping regions, the authors introduce the Fourier attention module (FBA), which selects different spatial - frequency features according to the denoising time step. The specific steps include: 1. Perform a fast Fourier transform (FFT) on the feature map: \[ F(m, n)=\sum_{h, w}x(h, w)\exp\left(-j2\pi\left(\frac{h}{H}m+\frac{w}{W}n\right)\right) \] where \(j^2 = - 1\). 2. Create a mask according to the time step to select features of a specific frequency: \[ r_t = 1-\frac{t}{T} \] \[ M_{r_t}^F=\begin{cases} 1 & \text{if }(h, w)\notin[-r_tH:r_tH, -r_tW:r_tW]\\ 0 & \text{otherwise} \end{cases} \] 3. Apply the inverse transform and combine with position encoding: \[ \bar{G}_j^t = F^{-1}(M_{r_t}^F\odot F(G_j^t))+\gamma(1 - r_t) \] 4. Finally, combine the features of overlapping and non - overlapping regions: \[ V_{i,j}^t = M_{\text{ovr}}^{i,j}\odot\bar{F}_j^t+(1 - M_{\text{ovr}}^{i,j})\odot\bar{G}_j^t \] ### 3. Cross - attention loss To ensure the consistency between different views, the authors introduce the cross - attention loss (XA Loss).

Multi-view Image Diffusion via Coordinate Noise and Fourier Attention

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

4Diffusion: Multi-view Video Diffusion Model for 4D Generation

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Text-Image Conditioned Diffusion for Consistent Text-to-3D Generation

Text-image Alignment for Diffusion-based Perception

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

One Diffusion to Generate Them All

Collaborative Diffusion for Multi-Modal Face Generation and Editing

ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion

MultiSpectral diffusion: joint generation of wavelet coefficients for image synthesis and upsampling

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation

Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis

Dense Text-to-Image Generation with Attention Modulation

MultiDiff: Consistent Novel View Synthesis from a Single Image

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion