Abstract:Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However, challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation, lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations, we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **Generate realistic 3D urban scenes directly from satellite images for seamless integration into applications such as games, movies, and map services**. However, the main challenges faced by this task include: 1. **Significant differences in viewing angles**: There are significant differences in viewing angles between satellite images and street - view images on the ground. 2. **Large - scale scenes**: The scale of urban scenes to be processed is very large, resulting in huge consumption of computing resources. 3. **Multi - view consistency**: Existing methods have difficulty maintaining consistency between frames when generating images from different viewing angles. To solve these problems, the authors propose a new architecture, Sat2Scene, which uses diffusion models combined with neural rendering techniques to generate realistic 3D urban scenes directly from satellite images and ensure that images generated from any viewing angle are consistent and of high quality. ### Main contributions 1. **Propose the diffusion - model - based framework Sat2Scene**, which can generate 3D urban scenes directly from satellite images. 2. **Introduce the diffusion model with sparse representation** for generating scene features closely associated with geometry in 3D space, ensuring consistent images are generated from any viewing angle. To the best of the authors' knowledge, this is the first time that a diffusion model has been combined with 3D sparse representation. 3. **Demonstrate the ability to generate realistic sequences of street - view images** with strong temporal consistency. Experimental results show that this model is superior to existing methods in terms of overall video quality and inter - frame consistency, and can be applied to cross - view urban scene generation. ### Method overview The method of Sat2Scene is divided into three main steps: 1. **Generation stage**: Use a 3D sparse diffusion model to color the foreground point cloud and use a 2D diffusion model to generate a background sky panorama. 2. **Feature extraction stage**: Extract foreground features from the initially colored scene through a 3D encoder. 3. **Rendering stage**: Use neural rendering techniques to generate images with multi - view consistency according to a given pose. Through these steps, Sat2Scene can efficiently process large - scale outdoor scenes and generate high - quality and consistent images.

Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion

DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis

Urban Scene Diffusion through Semantic Occupancy Map

Diffusion-based Generation, Optimization, and Planning in 3D Scenes

SemCity: Semantic Scene Generation with Triplane Diffusion

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

SceneDreamer: Unbounded 3D Scene Generation from 2D Image Collections

Denoising Diffusion via Image-Based Rendering

Enhanced 3D Generation by 2D Editing

DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis

Pyramid Diffusion for Fine 3D Large Scene Generation

Baking Gaussian Splatting into Diffusion Denoiser for Fast and Scalable Single-stage Image-to-3D Generation

DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation

SSEditor: Controllable Mask-to-Scene Generation with Diffusion Model

3D Scene Diffusion Guidance using Scene Graphs

LT3SD: Latent Trees for 3D Scene Diffusion

Urban Architect: Steerable 3D Urban Scene Generation with Layout Prior

OccScene: Semantic Occupancy-based Cross-task Mutual Learning for 3D Scene Generation

DiffusionSat: A Generative Foundation Model for Satellite Imagery

Wonderland: Navigating 3D Scenes from a Single Image