Abstract:Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e.g., low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at <a class="link-external link-https" href="https://github.com/ljl5261/MMM-RS" rel="external noopener nofollow">this https URL</a>.

Diffusion-Geo: A Two-Stage Controllable Text-To-Image Generative Model for Remote Sensing Scenarios

CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

DiffusionSat: A Generative Foundation Model for Satellite Imagery

Controllable Generation with Text-to-Image Diffusion Models: A Survey

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

DiffusionGPT: LLM-Driven Text-to-Image Generation System

MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation

ECNet: Effective Controllable Text-to-Image Diffusion Models

Generating Images with 3D Annotations Using Diffusion Models

MapGen-Diff: An End-to-End Remote Sensing Image to Map Generator via Denoising Diffusion Bridge Model

Remote Diffusion

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Controlled Training Data Generation with Diffusion Models

Contextualized Diffusion Models for Text-Guided Image and Video Generation

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

Diffusion Models Meet Remote Sensing: Principles, Methods, and Perspectives

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

Diffusion Models and Pseudo-Change: A Transfer Learning-Based Change Detection in Remote Sensing Images