Abstract:Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e.g., low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at <a class="link-external link-https" href="https://github.com/ljl5261/MMM-RS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the challenge of generating diverse remote sensing (RS) images. Specifically, existing remote sensing image generation datasets lack in modality, ground sampling distance (GSD), and scene diversity, resulting in models trained on these datasets being unable to generate high-quality, diverse remote sensing images. To solve this problem, the authors propose a multi-modal, multi-GSD, multi-scene remote sensing dataset (MMM-RS) and use it for text-to-image generation tasks. ### Main Contributions 1. **Construction of a Large-Scale Multi-Modal, Multi-GSD, Multi-Scene Remote Sensing Dataset**: - The authors collected 9 publicly available remote sensing datasets and standardized all samples, ultimately constructing a dataset containing approximately 2.1 million pairs of information-rich text-image pairs. - Each sample in the dataset includes not only multi-modal images (such as RGB, SAR, NIR images) but also detailed, information-rich text prompts describing the image content, GSD level, weather type, and satellite type. 2. **Design of GSD Sample Extraction Strategy**: - To provide samples with different GSD levels, the authors designed a GSD sample extraction strategy to extract images with different GSD levels from each sample and defined text prompts describing different GSD levels. 3. **Synthesis of Multi-Scene Remote Sensing Images**: - Due to the lack of real-world multi-scene samples, the authors selected some RGB samples and used existing techniques to synthesize samples of different scenes, including fog, snow, and low-light environments. 4. **Validation of the Dataset's Effectiveness**: - The authors fine-tuned pre-trained text-to-image diffusion models such as Stable Diffusion using the proposed MMM-RS dataset and validated the dataset's effectiveness through extensive quantitative and qualitative experiments. - Experimental results show that models trained with the MMM-RS dataset can generate high-quality remote sensing images that are multi-modal, multi-GSD, and multi-scene. ### Experimental Results 1. **Quantitative Comparison**: - The authors conducted a quantitative comparison of different generation models using FID and IS metrics. The results show that models trained with the MMM-RS dataset outperform other models in both FID and IS metrics, indicating higher image quality and diversity. 2. **Qualitative Comparison**: - By generating remote sensing images of multiple scenes and multiple GSDs, the authors demonstrated the model's generation capabilities under different weather conditions and resolutions. The results show that models trained with the MMM-RS dataset can accurately generate images with complex weather conditions such as snow, fog, and night scenes, and the images generated at different GSD levels exhibit significant resolution changes. 3. **Cross-Modal Generation Experiments**: - The authors further conducted cross-modal generation experiments using ControlNet to verify the effectiveness and rationality of multi-modal data. The experimental results show that the model can effectively perform cross-modal generation between RGB and SAR, and RGB and NIR. In summary, this paper significantly enhances the ability of generation models to produce high-quality, multi-modal, multi-GSD, and multi-scene remote sensing images by constructing a comprehensive and diverse remote sensing dataset, providing important resources and support for research in the field of remote sensing.

MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation

Diffusion-Geo: A Two-Stage Controllable Text-To-Image Generative Model for Remote Sensing Scenarios

RSGPT: A Remote Sensing Vision Language Model and Benchmark

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

GAMUS: A Geometry-aware Multi-modal Semantic Segmentation Benchmark for Remote Sensing Data

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

CRS-Diff: Controllable Remote Sensing Image Generation with Diffusion Model

EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

Discrete diffusion models with Refined Language–Image Pre-trained representations for remote sensing image captioning

MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

MLRSNet: A multi-label high spatial resolution remote sensing dataset for semantic scene understanding

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition