MVInpainter: Learning Multi-View Consistent Inpainting to Bridge 2D and 3D Editing

Chenjie Cao,Chaohui Yu,Yanwei Fu,Fan Wang,Xiangyang Xue
2024-08-15
Abstract:Novel View Synthesis (NVS) and 3D generation have recently achieved prominent improvements. However, these works mainly focus on confined categories or synthetic 3D assets, which are discouraged from generalizing to challenging in-the-wild scenes and fail to be employed with 2D synthesis directly. Moreover, these methods heavily depended on camera poses, limiting their real-world applications. To overcome these issues, we propose MVInpainter, re-formulating the 3D editing as a multi-view 2D inpainting task. Specifically, MVInpainter partially inpaints multi-view images with the reference guidance rather than intractably generating an entirely novel view from scratch, which largely simplifies the difficulty of in-the-wild NVS and leverages unmasked clues instead of explicit pose conditions. To ensure cross-view consistency, MVInpainter is enhanced by video priors from motion components and appearance guidance from concatenated reference key&value attention. Furthermore, MVInpainter incorporates slot attention to aggregate high-level optical flow features from unmasked regions to control the camera movement with pose-free training and inference. Sufficient scene-level experiments on both object-centric and forward-facing datasets verify the effectiveness of MVInpainter, including diverse tasks, such as multi-view object removal, synthesis, insertion, and replacement. The project page is <a class="link-external link-https" href="https://ewrfcas.github.io/MVInpainter/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper introduces MVInpainter, a novel approach to bridge 2D and 3D scene editing by formulating 3D editing as a multi-view 2D inpainting task. The authors aim to solve several key problems: 1. **Generalization Across Categories and Real-World Scenes:** Existing methods often focus on specific categories or synthetic 3D assets and struggle to generalize to real-world, in-the-wild scenes with complex lighting and shadows. 2. **Seamless Integration of Foreground and Background:** Methods that integrate 3D assets into neural radiance fields (NeRFs) or 3D Gaussian splatting (3DGS) often fail to blend foreground and background elements seamlessly. 3. **Difficulty Generalizing Novel View Synthesis (NVS):** Current NVS methods, even those enhanced by diffusion models, tend to work only in specific scenarios and fail to generalize to diverse or unseen categories in scene data. 4. **Time-Consuming Instance-Level 3D Editing:** Approaches that perform instance-level 3D editing or integrate priors from single-view text-to-image (T2I) models require costly dataset updates to address multi-view inconsistency. 5. **Dependency on Camera Poses:** Many methods rely heavily on accurate camera poses for both training and inference, which limits their scalability and broader applicability, particularly in scenarios where detailed poses are unavailable, such as short video editing. To address these issues, the authors propose MVInpainter, a multi-view consistent inpainting model that part