Multi-view Image Diffusion via Coordinate Noise and Fourier Attention

Justin Theiss,Norman Müller,Daeil Kim,Aayush Prakash
2024-12-05
Abstract:Recently, text-to-image generation with diffusion models has made significant advancements in both higher fidelity and generalization capabilities compared to previous baselines. However, generating holistic multi-view consistent images from prompts still remains an important and challenging task. To address this challenge, we propose a diffusion process that attends to time-dependent spatial frequencies of features with a novel attention mechanism as well as novel noise initialization technique and cross-attention loss. This Fourier-based attention block focuses on features from non-overlapping regions of the generated scene in order to better align the global appearance. Our noise initialization technique incorporates shared noise and low spatial frequency information derived from pixel coordinates and depth maps to induce noise correlations across views. The cross-attention loss further aligns features sharing the same prompt across the scene. Our technique improves SOTA on several quantitative metrics with qualitatively better results when compared to other state-of-the-art approaches for multi-view consistency.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the consistency problem in multi - view image generation, especially the geometric and appearance consistency in non - overlapping regions. Specifically, although existing text - to - image generation methods based on diffusion models have made significant progress in single - view image generation, they still face challenges when generating multi - view consistent images. These problems are mainly reflected as follows: 1. **Consistency in non - overlapping regions**: Existing methods perform poorly when dealing with non - overlapping regions between views, resulting in inconsistent appearance and geometric structures in these regions. 2. **Differences in initialization noise**: There is an information gap in the noise used in the training and inference processes, which affects the consistency of multi - view image generation. 3. **Insufficient feature alignment**: Existing methods have limited effectiveness in aligning features between different views, especially in non - overlapping regions. To solve these problems, the authors propose a new diffusion process, which combines coordinate noise and Fourier attention mechanism, and improves the consistency of multi - view image generation by introducing cross - attention loss. The following are the specific technical contributions: ### 1. Coordinate noise initialization The authors propose a new noise initialization technique to initialize noise samples by sharing noise and low - spatial - frequency information. The specific formulas are as follows: \[ \hat{\epsilon}_i = w\cdot c_i+(1 - w)\cdot\epsilon_{\text{shared}} \] \[ \hat{z}_T^i=\sqrt{\bar{\alpha}_T}\hat{\epsilon}_i+\sqrt{1 - \bar{\alpha}_T}\epsilon_i \] where \(c_i\) represents the low - frequency coordinate/depth map information of the \(i\)-th view, \(\epsilon_{\text{shared}}\) is the noise shared by all views, \(\epsilon_i\) is the Gaussian noise independently sampled for each view, and \(w\) is a weight parameter. ### 2. Fourier attention module To align the features in non - overlapping regions, the authors introduce the Fourier attention module (FBA), which selects different spatial - frequency features according to the denoising time step. The specific steps include: 1. Perform a fast Fourier transform (FFT) on the feature map: \[ F(m, n)=\sum_{h, w}x(h, w)\exp\left(-j2\pi\left(\frac{h}{H}m+\frac{w}{W}n\right)\right) \] where \(j^2 = - 1\). 2. Create a mask according to the time step to select features of a specific frequency: \[ r_t = 1-\frac{t}{T} \] \[ M_{r_t}^F=\begin{cases} 1 & \text{if }(h, w)\notin[-r_tH:r_tH, -r_tW:r_tW]\\ 0 & \text{otherwise} \end{cases} \] 3. Apply the inverse transform and combine with position encoding: \[ \bar{G}_j^t = F^{-1}(M_{r_t}^F\odot F(G_j^t))+\gamma(1 - r_t) \] 4. Finally, combine the features of overlapping and non - overlapping regions: \[ V_{i,j}^t = M_{\text{ovr}}^{i,j}\odot\bar{F}_j^t+(1 - M_{\text{ovr}}^{i,j})\odot\bar{G}_j^t \] ### 3. Cross - attention loss To ensure the consistency between different views, the authors introduce the cross - attention loss (XA Loss).