Jialin Luo,Yuanzhi Wang,Ziqi Gu,Yide Qiu,Shuaizhen Yao,Fuyun Wang,Chunyan Xu,Wenhua Zhang,Dan Wang,Zhen Cui
Abstract:Recently, the diffusion-based generative paradigm has achieved impressive general image generation capabilities with text prompts due to its accurate distribution modeling and stable training process. However, generating diverse remote sensing (RS) images that are tremendously different from general images in terms of scale and perspective remains a formidable challenge due to the lack of a comprehensive remote sensing image generation dataset with various modalities, ground sample distances (GSD), and scenes. In this paper, we propose a Multi-modal, Multi-GSD, Multi-scene Remote Sensing (MMM-RS) dataset and benchmark for text-to-image generation in diverse remote sensing scenarios. Specifically, we first collect nine publicly available RS datasets and conduct standardization for all samples. To bridge RS images to textual semantic information, we utilize a large-scale pretrained vision-language model to automatically output text prompts and perform hand-crafted rectification, resulting in information-rich text-image pairs (including multi-modal images). In particular, we design some methods to obtain the images with different GSD and various environments (e.g., low-light, foggy) in a single sample. With extensive manual screening and refining annotations, we ultimately obtain a MMM-RS dataset that comprises approximately 2.1 million text-image pairs. Extensive experimental results verify that our proposed MMM-RS dataset allows off-the-shelf diffusion models to generate diverse RS images across various modalities, scenes, weather conditions, and GSD. The dataset is available at <a class="link-external link-https" href="https://github.com/ljl5261/MMM-RS" rel="external noopener nofollow">this https URL</a>.
Image and Video Processing,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve
This paper aims to address the challenge of generating diverse remote sensing (RS) images. Specifically, existing remote sensing image generation datasets lack in modality, ground sampling distance (GSD), and scene diversity, resulting in models trained on these datasets being unable to generate high-quality, diverse remote sensing images. To solve this problem, the authors propose a multi-modal, multi-GSD, multi-scene remote sensing dataset (MMM-RS) and use it for text-to-image generation tasks.
### Main Contributions
1. **Construction of a Large-Scale Multi-Modal, Multi-GSD, Multi-Scene Remote Sensing Dataset**:
- The authors collected 9 publicly available remote sensing datasets and standardized all samples, ultimately constructing a dataset containing approximately 2.1 million pairs of information-rich text-image pairs.
- Each sample in the dataset includes not only multi-modal images (such as RGB, SAR, NIR images) but also detailed, information-rich text prompts describing the image content, GSD level, weather type, and satellite type.
2. **Design of GSD Sample Extraction Strategy**:
- To provide samples with different GSD levels, the authors designed a GSD sample extraction strategy to extract images with different GSD levels from each sample and defined text prompts describing different GSD levels.
3. **Synthesis of Multi-Scene Remote Sensing Images**:
- Due to the lack of real-world multi-scene samples, the authors selected some RGB samples and used existing techniques to synthesize samples of different scenes, including fog, snow, and low-light environments.
4. **Validation of the Dataset's Effectiveness**:
- The authors fine-tuned pre-trained text-to-image diffusion models such as Stable Diffusion using the proposed MMM-RS dataset and validated the dataset's effectiveness through extensive quantitative and qualitative experiments.
- Experimental results show that models trained with the MMM-RS dataset can generate high-quality remote sensing images that are multi-modal, multi-GSD, and multi-scene.
### Experimental Results
1. **Quantitative Comparison**:
- The authors conducted a quantitative comparison of different generation models using FID and IS metrics. The results show that models trained with the MMM-RS dataset outperform other models in both FID and IS metrics, indicating higher image quality and diversity.
2. **Qualitative Comparison**:
- By generating remote sensing images of multiple scenes and multiple GSDs, the authors demonstrated the model's generation capabilities under different weather conditions and resolutions. The results show that models trained with the MMM-RS dataset can accurately generate images with complex weather conditions such as snow, fog, and night scenes, and the images generated at different GSD levels exhibit significant resolution changes.
3. **Cross-Modal Generation Experiments**:
- The authors further conducted cross-modal generation experiments using ControlNet to verify the effectiveness and rationality of multi-modal data. The experimental results show that the model can effectively perform cross-modal generation between RGB and SAR, and RGB and NIR.
In summary, this paper significantly enhances the ability of generation models to produce high-quality, multi-modal, multi-GSD, and multi-scene remote sensing images by constructing a comprehensive and diverse remote sensing dataset, providing important resources and support for research in the field of remote sensing.