Abstract:Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation, however, the generation of 3D scenes and even $360^{\circ}$ images remains constrained, due to the limited number of scene datasets, the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images. To address these issues, we first establish a large-scale panoramic video-text dataset containing millions of consecutive panoramic keyframes with corresponding panoramic depths, camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation. Specifically, benefiting from the powerful generative capabilities of stable diffusion, we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset. We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images. Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to generate scalable and multi-view consistent panoramic images from text descriptions and camera poses (see Figure 1). Specifically, existing methods face the following issues when generating 360° panoramic images: 1. **Insufficient Dataset**: There is a lack of rich and diverse panoramic datasets, which cannot meet the task requirements of generating multi-view panoramic images from text. 2. **Poor Consistency in Generation**: Existing methods struggle to ensure image consistency across different viewpoints when generating multi-view panoramic images. 3. **Limited Application Scenarios**: Existing single-view panoramic generation methods mainly support 3 degrees of freedom (3DoF) roaming and cannot generate multi-view panoramic images that support 6 degrees of freedom (6DoF) roaming. To address these issues, the paper proposes the following key contributions: 1. **Proposed a New Task**: For the first time, the task of generating scalable and multi-view consistent panoramic images from text descriptions and camera poses is proposed. 2. **Constructed a Large-Scale Dataset**: A large-scale panoramic video-text dataset containing millions of panoramic keyframes and their corresponding panoramic depth, camera poses, and text descriptions is established. 3. **Designed a New Generation Framework**: A text-driven panoramic generation framework, DiffPano, is proposed. This framework includes a single-view text-to-panoramic diffusion model and an epipolar-aware multi-view diffusion model, capable of generating scalable and consistent panoramic images. ### Specific Methods 1. **Single-View Panoramic Stable Diffusion Model**: - Utilizes LoRA (Low-Rank Adaptation) technology to fine-tune the pre-trained perspective image diffusion model to generate single-view panoramic images. - Enhances data (randomly stitching the left and right parts of panoramic images) to improve the left-right continuity of the generated images. 2. **Epipolar-Aware Multi-View Diffusion Model**: - Introduces epipolar constraints to ensure the consistency of generated multi-view panoramic images across different viewpoints. - Designs an epipolar-aware attention module that calculates the epipolar relationship between the target view and the reference view to achieve multi-view consistency. ### Experimental Results 1. **Single-View Panoramic Generation**: - Compared to baseline methods like Text2Light and PanFusion, the proposed Pano-SD method excels in generation quality, diversity, and inference time. - Quantitative evaluation results show that Pano-SD slightly outperforms other methods in CS value and significantly lower FID value than Text2Light, approaching PanFusion. 2. **Multi-View Panoramic Generation**: - User studies and quantitative evaluations verify the effectiveness of the epipolar-aware attention module. - Experimental results show that DiffPano outperforms MVDream and PanFusion in terms of image quality, text-image consistency, and multi-view consistency. In summary, this paper successfully addresses the problem of generating scalable and multi-view consistent panoramic images from text descriptions and camera poses by constructing a large-scale dataset and designing a new generation framework. It has broad application prospects, such as immersive VR roaming and interior home design previews.

DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion

Taming Stable Diffusion for Text to 360° Panorama Image Generation

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models

SceneDreamer360: Text-Driven 3D-Consistent Scene Generation with Panoramic Gaussian Splatting

Customizing 360-Degree Panoramas through Text-to-Image Diffusion Models

360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model

LayerPano3D: Layered 3D Panorama for Hyper-Immersive Scene Generation

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

DeepPanoContext: Panoramic 3D Scene Understanding with Holistic Scene Context Graph and Relation-based Optimization

360-Degree Panorama Generation from Few Unregistered NFoV Images

PanoDiffusion: 360-degree Panorama Outpainting via Diffusion

PGDM: Multimodal Panoramic Image Generation with Diffusion Models

CamFreeDiff: Camera-free Image to Panorama Generation with Diffusion Model

Text2Light: Zero-Shot Text-Driven HDR Panorama Generation

4K4DGen: Panoramic 4D Generation at 4K Resolution

TwinDiffusion: Enhancing Coherence and Efficiency in Panoramic Image Generation with Diffusion Models

Multi-Scale Diffusion: Enhancing Spatial Layout in High-Resolution Panoramic Image Generation

VidPanos: Generative Panoramic Videos from Casual Panning Videos

360PanT: Training-Free Text-Driven 360-Degree Panorama-to-Panorama Translation