Diffusion with Forward Models: Solving Stochastic Inverse Problems Without Direct Supervision

Ayush Tewari,Tianwei Yin,George Cazenavette,Semon Rezchikov,Joshua B. Tenenbaum,Frédo Durand,William T. Freeman,Vincent Sitzmann
2023-11-17
Abstract:Denoising diffusion models are a powerful type of generative models used to capture complex distributions of real-world signals. However, their applicability is limited to scenarios where training samples are readily available, which is not always the case in real-world applications. For example, in inverse graphics, the goal is to generate samples from a distribution of 3D scenes that align with a given image, but ground-truth 3D scenes are unavailable and only 2D images are accessible. To address this limitation, we propose a novel class of denoising diffusion probabilistic models that learn to sample from distributions of signals that are never directly observed. Instead, these signals are measured indirectly through a known differentiable forward model, which produces partial observations of the unknown signal. Our approach involves integrating the forward model directly into the denoising process. This integration effectively connects the generative modeling of observations with the generative modeling of the underlying signals, allowing for end-to-end training of a conditional generative model over signals. During inference, our approach enables sampling from the distribution of underlying signals that are consistent with a given partial observation. We demonstrate the effectiveness of our method on three challenging computer vision tasks. For instance, in the context of inverse graphics, our model enables direct sampling from the distribution of 3D scenes that align with a single 2D input image.
Computer Vision and Pattern Recognition,Graphics,Machine Learning
What problem does this paper attempt to address?
The paper aims to address a class of problems known as "Stochastic Inverse Problems," specifically how to recover the underlying true signal distribution from partial observations in the absence of direct supervision. In particular, the research focuses on how to use diffusion models to model complex real-world signal distributions without direct access to training samples. ### Research Objectives 1. **Develop new conditional denoising diffusion probabilistic models**: These models are capable of sampling from signal distributions that have never been directly observed, which can only be indirectly obtained through partial observations generated by known differentiable forward models. 2. **Avoid two-stage methods**: Existing techniques often require first reconstructing a large dataset and then training generative models on this dataset. The method proposed in this paper is trained directly within an end-to-end framework, avoiding this complex two-stage process. 3. **Directly generate diverse samples**: At test time, the method can directly generate diverse potential signal samples consistent with the given partial observations. ### Key Contributions 1. A new approach is proposed that integrates any differentiable forward model with a conditional denoising diffusion model to replace the previous two-step method with an end-to-end trained conditional generative model. 2. The proposed framework is applied to construct the first conditional diffusion model that learns to sample the distribution of 3D scenes solely through 2D image learning. This differs from previous work, which directly learns image-based 3D radiance field generation, rather than merely sampling the conditional distribution of novel views. 3. It is formally proven that under certain assumptions, as the number of observations of each signal in the training set tends to infinity, the proposed model not only maximizes the likelihood of the observations but also maximizes the likelihood of the unobserved signals. 4. The model's effectiveness is demonstrated on two additional downstream tasks: single-image motion prediction and GAN inversion, where the forward models are deformation operations and a pretrained StyleGAN generator, respectively. ### Technical Details - **Denoising Diffusion Models**: The research utilizes denoising diffusion probabilistic models to generate highly diverse samples, but existing methods require direct access to the signal, whereas the method proposed in this paper does not need direct access to the signal itself, but rather is trained by integrating a differentiable forward model. - **Loss Function**: During training, two loss terms are minimized, including the loss targeted at the desired observations and the loss targeted at new forward model parameters. These losses approximate the total observational loss, thereby maximizing the likelihood of all possible observed signals. - **3D Scene Generation**: For inverse graphics applications, a 3D structure denoising operator based on pixel-aligned features is constructed, which can learn the distribution of 3D scenes solely from image observations. Given context images and their camera poses, a target pose is selected, and an encoder that extracts features from context views using pixel-aligned features is used to render a deterministic target view estimate, which is then combined with noisy target observations to extract features of the target view, thereby generating the 3D scene. ### Experimental Results - The method was evaluated on the Co3D hydrants and RealEstate10K datasets, and the results show that it can generate diverse and reasonable 3D scene samples. - Compared to existing methods such as pixelNeRF and SparseFusion, the method performs well in terms of visual quality and diversity, especially when dealing with uncertain areas.