Abstract:This paper explores the domain of multi-view image synthesis, aiming to create specific image elements or entire scenes while ensuring visual consistency with reference images. We categorize this task into two approaches: local synthesis, guided by structural cues from reference images (Reference-based inpainting, Ref-inpainting), and global synthesis, which generates entirely new images based solely on reference examples (Novel View Synthesis, NVS). In recent years, Text-to-Image (T2I) generative models have gained attention in various domains. However, adapting them for multi-view synthesis is challenging due to the intricate correlations between reference and target images. To address these challenges efficiently, we introduce Attention Reactivated Contextual Inpainting (ARCI), a unified approach that reformulates both local and global reference-based multi-view synthesis as contextual inpainting, which is enhanced with pre-existing attention mechanisms in T2I models. Formally, self-attention is leveraged to learn feature correlations across different reference views, while cross-attention is utilized to control the generation through prompt tuning. Our contributions of ARCI, built upon the StableDiffusion fine-tuned for text-guided inpainting, include skillfully handling difficult multi-view synthesis tasks with off-the-shelf T2I models, introducing task and view-specific prompt tuning for generative control, achieving end-to-end Ref-inpainting, and implementing block causal masking for autoregressive NVS. We also show the versatility of ARCI by extending it to multi-view generation for superior consistency with the same architecture, which has also been validated through extensive experiments. Codes and models will be released in \url{https://github.com/ewrfcas/ARCI}.

Multi-View Unsupervised Image Generation with Cross Attention Guidance

MPS-NeRF: Generalizable 3D Human Rendering from Multiview Images

Multi-task View Synthesis with Neural Radiance Fields

GM-NeRF: Learning Generalizable Model-based Neural Radiance Fields from Multi-view Images

ViewFusion: Towards Multi-View Consistency via Interpolated Denoising

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model

Harnessing Text-to-Image Attention Prior for Reference-based Multi-view Image Synthesis

ViewFusion: Learning Composable Diffusion Models for Novel View Synthesis

Pixel-Aligned Multi-View Generation with Depth Guided Decoder

Multi-View Consistent Generative Adversarial Networks for 3D-Aware Image Synthesis

Consistent-1-to-3: Consistent Image to 3D View Synthesis via Geometry-aware Diffusion Models

Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations

RC-MVSNet: Unsupervised Multi-View Stereo with Neural Rendering

SyncDreamer: Generating Multiview-consistent Images from a Single-view Image

MultiDiff: Consistent Novel View Synthesis from a Single Image

Multi-Channel Attention Selection GAN With Cascaded Semantic Guidance for Cross-View Image Translation

VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

Multi-view Consistent Generative Adversarial Networks for Compositional 3D-Aware Image Synthesis

Neural Radiance Fields Approach to Deep Multi-View Photometric Stereo

Consolidating Attention Features for Multi-view Image Editing

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion