MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Chenjie Cao,Chaohui Yu,Yanwei Fu,Fan Wang,Xiangyang Xue

2024-08-15

Abstract:Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements. However, these works mainly focus on confined categories or synthetic 3D assets, which are discouraged from generalizing to challenging in-the-wild scenes and fail to be employed with 2D synthesis directly. Moreover, these methods heavily depended on camera poses, limiting their real-world applications. To overcome these issues, we propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task. Specifically, MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions. To ensure cross-view consistency, MVInpainter is enhanced by video priors from motion components and appearance guidance from concatenated reference key&value attention. Furthermore, MVInpainter incorporates slot attention to aggregate high-level optical flow features from unmasked regions to control the camera movement with pose-free training and inference. Sufficient scene-level experiments on both object-centric and forward-facing datasets verify the effectiveness of MVInpainter, including diverse tasks, such as multi-view object removal, synthesis, insertion, and replacement. The project page is <a class="link-external link-https" href="https://ewrfcas.github.io/MVInpainter/" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper introduces MVInpainter, a novel approach to bridge 2D and 3D scene editing by formulating 3D editing as a multi-view 2D inpainting task. The authors aim to solve several key problems: 1. **Generalization Across Categories and Real-World Scenes:** Existing methods often focus on specific categories or synthetic 3D assets and struggle to generalize to real-world, in-the-wild scenes with complex lighting and shadows. 2. **Seamless Integration of Foreground and Background:** Methods that integrate 3D assets into neural radiance fields (NeRFs) or 3D Gaussian splatting (3DGS) often fail to blend foreground and background elements seamlessly. 3. **Difficulty Generalizing Novel View Synthesis (NVS):** Current NVS methods, even those enhanced by diffusion models, tend to work only in specific scenarios and fail to generalize to diverse or unseen categories in scene data. 4. **Time-Consuming Instance-Level 3D Editing:** Approaches that perform instance-level 3D editing or integrate priors from single-view text-to-image (T2I) models require costly dataset updates to address multi-view inconsistency. 5. **Dependency on Camera Poses:** Many methods rely heavily on accurate camera poses for both training and inference, which limits their scalability and broader applicability, particularly in scenarios where detailed poses are unavailable, such as short video editing. To address these issues, the authors propose MVInpainter, a multi-view consistent inpainting model that part

MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Single-Mask Inpainting for Voxel-Based Neural Radiance Fields

iNVS: Repurposing Diffusion Inpainters for Novel View Synthesis

DVI: Depth Guided Video Inpainting for Autonomous Driving

MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior

Harnessing Text-to-Image Attention Prior for Reference-based Multi-view Image Synthesis

UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts

MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

Efficient MRF-based Disocclusion Inpainting in Multiview Video.

Performance Optimizations for Patchmatch-Based Pixel-Level Multiview Inpainting

MVSM-CLP: Multi View Synthesis Method for Chinese Landscape Painting Based on Depth Estimation

MV-DUSt3R+: Single-Stage Scene Reconstruction from Sparse Views In 2 Seconds

Deep Face Video Inpainting via UV Mapping

Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model

Deep Interactive Video Inpainting: an Invisibility Cloak for Harry Potter.

DiffMVR: Diffusion-based Automated Multi-Guidance Video Restoration

Rethinking the Multi-view Stereo from the Perspective of Rendering-based Augmentation

Image Inpainting by End-to-End Cascaded Refinement With Mask Awareness

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Vision-Infused Deep Audio Inpainting