DiffPano: Scalable and Consistent Text to Panorama Generation with Spherical Epipolar-Aware Diffusion

Weicai Ye,Chenhao Ji,Zheng Chen,Junyao Gao,Xiaoshui Huang,Song-Hai Zhang,Wanli Ouyang,Tong He,Cairong Zhao,Guofeng Zhang
2024-11-01
Abstract:Diffusion-based methods have achieved remarkable achievements in 2D image or 3D object generation, however, the generation of 3D scenes and even $360^{\circ}$ images remains constrained, due to the limited number of scene datasets, the complexity of 3D scenes themselves, and the difficulty of generating consistent multi-view images. To address these issues, we first establish a large-scale panoramic video-text dataset containing millions of consecutive panoramic keyframes with corresponding panoramic depths, camera poses, and text descriptions. Then, we propose a novel text-driven panoramic generation framework, termed DiffPano, to achieve scalable, consistent, and diverse panoramic scene generation. Specifically, benefiting from the powerful generative capabilities of stable diffusion, we fine-tune a single-view text-to-panorama diffusion model with LoRA on the established panoramic video-text dataset. We further design a spherical epipolar-aware multi-view diffusion model to ensure the multi-view consistency of the generated panoramic images. Extensive experiments demonstrate that DiffPano can generate scalable, consistent, and diverse panoramic images with given unseen text descriptions and camera poses.
Computer Vision and Pattern Recognition,Artificial Intelligence,Graphics,Robotics
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to generate scalable and multi-view consistent panoramic images from text descriptions and camera poses (see Figure 1). Specifically, existing methods face the following issues when generating 360° panoramic images: 1. **Insufficient Dataset**: There is a lack of rich and diverse panoramic datasets, which cannot meet the task requirements of generating multi-view panoramic images from text. 2. **Poor Consistency in Generation**: Existing methods struggle to ensure image consistency across different viewpoints when generating multi-view panoramic images. 3. **Limited Application Scenarios**: Existing single-view panoramic generation methods mainly support 3 degrees of freedom (3DoF) roaming and cannot generate multi-view panoramic images that support 6 degrees of freedom (6DoF) roaming. To address these issues, the paper proposes the following key contributions: 1. **Proposed a New Task**: For the first time, the task of generating scalable and multi-view consistent panoramic images from text descriptions and camera poses is proposed. 2. **Constructed a Large-Scale Dataset**: A large-scale panoramic video-text dataset containing millions of panoramic keyframes and their corresponding panoramic depth, camera poses, and text descriptions is established. 3. **Designed a New Generation Framework**: A text-driven panoramic generation framework, DiffPano, is proposed. This framework includes a single-view text-to-panoramic diffusion model and an epipolar-aware multi-view diffusion model, capable of generating scalable and consistent panoramic images. ### Specific Methods 1. **Single-View Panoramic Stable Diffusion Model**: - Utilizes LoRA (Low-Rank Adaptation) technology to fine-tune the pre-trained perspective image diffusion model to generate single-view panoramic images. - Enhances data (randomly stitching the left and right parts of panoramic images) to improve the left-right continuity of the generated images. 2. **Epipolar-Aware Multi-View Diffusion Model**: - Introduces epipolar constraints to ensure the consistency of generated multi-view panoramic images across different viewpoints. - Designs an epipolar-aware attention module that calculates the epipolar relationship between the target view and the reference view to achieve multi-view consistency. ### Experimental Results 1. **Single-View Panoramic Generation**: - Compared to baseline methods like Text2Light and PanFusion, the proposed Pano-SD method excels in generation quality, diversity, and inference time. - Quantitative evaluation results show that Pano-SD slightly outperforms other methods in CS value and significantly lower FID value than Text2Light, approaching PanFusion. 2. **Multi-View Panoramic Generation**: - User studies and quantitative evaluations verify the effectiveness of the epipolar-aware attention module. - Experimental results show that DiffPano outperforms MVDream and PanFusion in terms of image quality, text-image consistency, and multi-view consistency. In summary, this paper successfully addresses the problem of generating scalable and multi-view consistent panoramic images from text descriptions and camera poses by constructing a large-scale dataset and designing a new generation framework. It has broad application prospects, such as immersive VR roaming and interior home design previews.