Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

Zuoyue Li,Zhenqiang Li,Zhaopeng Cui,Marc Pollefeys,Martin R. Oswald
2024-04-01
Abstract:Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations, we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **Generate realistic 3D urban scenes directly from satellite images for seamless integration into applications such as games, movies, and map services**. However, the main challenges faced by this task include: 1. **Significant differences in viewing angles**: There are significant differences in viewing angles between satellite images and street - view images on the ground. 2. **Large - scale scenes**: The scale of urban scenes to be processed is very large, resulting in huge consumption of computing resources. 3. **Multi - view consistency**: Existing methods have difficulty maintaining consistency between frames when generating images from different viewing angles. To solve these problems, the authors propose a new architecture, Sat2Scene, which uses diffusion models combined with neural rendering techniques to generate realistic 3D urban scenes directly from satellite images and ensure that images generated from any viewing angle are consistent and of high quality. ### Main contributions 1. **Propose the diffusion - model - based framework Sat2Scene**, which can generate 3D urban scenes directly from satellite images. 2. **Introduce the diffusion model with sparse representation** for generating scene features closely associated with geometry in 3D space, ensuring consistent images are generated from any viewing angle. To the best of the authors' knowledge, this is the first time that a diffusion model has been combined with 3D sparse representation. 3. **Demonstrate the ability to generate realistic sequences of street - view images** with strong temporal consistency. Experimental results show that this model is superior to existing methods in terms of overall video quality and inter - frame consistency, and can be applied to cross - view urban scene generation. ### Method overview The method of Sat2Scene is divided into three main steps: 1. **Generation stage**: Use a 3D sparse diffusion model to color the foreground point cloud and use a 2D diffusion model to generate a background sky panorama. 2. **Feature extraction stage**: Extract foreground features from the initially colored scene through a 3D encoder. 3. **Rendering stage**: Use neural rendering techniques to generate images with multi - view consistency according to a given pose. Through these steps, Sat2Scene can efficiently process large - scale outdoor scenes and generate high - quality and consistent images.