Taming Latent Diffusion Model for Neural Radiance Field Inpainting

Chieh Hubert Lin,Changil Kim,Jia-Bin Huang,Qinbo Li,Chih-Yao Ma,Johannes Kopf,Ming-Hsuan Yang,Hung-Yu Tseng
2024-04-16
Abstract:Neural Radiance Field (NeRF) is a representation for 3D reconstruction from multi-view images. Despite some recent work showing preliminary success in editing a reconstructed NeRF with diffusion prior, they remain struggling to synthesize reasonable geometry in completely uncovered regions. One major reason is the high diversity of synthetic contents from the diffusion model, which hinders the radiance field from converging to a crisp and deterministic geometry. Moreover, applying latent diffusion models on real data often yields a textural shift incoherent to the image condition due to auto-encoding errors. These two problems are further reinforced with the use of pixel-distance losses. To address these issues, we propose tempering the diffusion model's stochasticity with per-scene customization and mitigating the textural shift with masked adversarial training. During the analyses, we also found the commonly used pixel and perceptual losses are harmful in the NeRF inpainting task. Through rigorous experiments, our framework yields state-of-the-art NeRF inpainting results on various real-world scenes. Project page:
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the deficiencies of Neural Radiance Field (NeRF) in inpainting. Specifically, although existing methods can achieve high - quality 3D reconstruction and novel view synthesis in multi - view images, they still face challenges in generating reasonable geometric structures in completely uncovered areas. In addition, when applying the latent diffusion model (LDM) for 2D image inpainting, texture shift usually occurs due to auto - encoding errors, thus introducing obvious artifacts in the finally inpainted NeRF. To solve these problems, the authors propose the following improvement measures: 1. **Reduce the randomness of the diffusion model**: Customize the adjustment for each scene to make the diffusion model more in line with the characteristics of a specific scene. 2. **Alleviate texture shift**: Use masked adversarial training to hide the inpainting boundaries and prevent the discriminator from using these boundaries to identify real image patches, thereby reducing the texture difference between the inpainting area and the reconstruction area. 3. **Optimize the design of the loss function**: It has been found that the commonly used pixel - level and perceptual losses are harmful to the NeRF inpainting task. Therefore, a new combination of loss functions is proposed, including adversarial loss and feature - matching loss. Through these improvements, the method proposed in this paper (MALD - NeRF) achieves state - of - the - art NeRF inpainting effects on multiple real - scene datasets, especially in terms of high - frequency detail preservation and texture consistency. ### Specific Problem Summary - **Problem Background**: NeRF has the problem of unreasonable geometric structures when inpainting completely uncovered areas, and texture shift is likely to occur when using LDM for 2D image inpainting. - **Solutions**: - Use masked adversarial training to reduce the texture difference between the inpainting area and the reconstruction area. - Reduce the randomness of the diffusion model by customizing the adjustment for each scene. - Design a new combination of loss functions to avoid the negative impacts brought by the commonly used pixel - level and perceptual losses. - **Experimental Results**: MALD - NeRF achieves better results than existing methods on multiple datasets, especially in terms of visual quality and quantitative evaluation metrics such as FID and KID. ### Mathematical Formula Representation - **Adversarial Loss**: \[ L_{\text{adv}} = f(D(C_m(x_m))) + f(-D(C_r(\hat{x}_r))) \] where \( f(x) = -\log(1 + \exp(-x)) \), \( C_m \) and \( C_r \) are the mask functions of the inpainting area and the non - inpainting area respectively, and \( D \) is the discriminator. - **Discriminator Feature Matching Loss**: \[ L_{\text{fm}} = \| F(C_m(x_m)) - F(C_m(\hat{x}_m)) \|_1 \] where \( F \) is the feature extracted from the intermediate layer of the discriminator. These improvements make MALD - NeRF more robust and efficient in handling complex NeRF inpainting tasks.