Abstract:Generative foundation models have advanced large-scale text-driven natural image generation, becoming a prominent research trend across various vertical domains. However, in the remote sensing field, there is still a lack of research on large-scale text-to-image (text2image) generation technology. Existing remote sensing image-text datasets are small in scale and confined to specific geographic areas and scene types. Besides, existing text2image methods have struggled to achieve global-scale, multi-resolution controllable, and unbounded image generation. To address these challenges, this paper presents two key contributions: the Git-10M dataset and the Text2Earth foundation model. Git-10M is a global-scale image-text dataset comprising 10 million image-text pairs, 5 times larger than the previous largest one. The dataset covers a wide range of geographic scenes and contains resolution information, significantly surpassing existing datasets in both size and diversity. Building on Git-10M, we propose Text2Earth, a 1.3 billion parameter generative foundation model based on the diffusion framework to model global-scale remote sensing scenes. Text2Earth integrates a resolution guidance mechanism, enabling users to specify image resolutions. A dynamic condition adaptation strategy is proposed for training and inference to improve image quality. Text2Earth excels in zero-shot text2image generation and demonstrates robust generalization and flexibility across multiple tasks, including unbounded scene construction, image editing, and cross-modal image generation. This robust capability surpasses previous models restricted to the basic fixed size and limited scene types. On the previous benchmark dataset, Text2Earth outperforms previous models with an improvement of +26.23 FID and +20.95% Zero-shot Cls-OA <a class="link-external link-http" href="http://metric.Our" rel="external noopener nofollow">this http URL</a> project page is \url{<a class="link-external link-https" href="https://chen-yang-liu.github.io/Text2Earth" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the field of remote sensing, research on large - scale text - driven image generation technology (text - to - image generation) is relatively insufficient. The existing remote - sensing image - text datasets are small in scale and limited to specific geographical areas and scene types, unable to meet the needs of large - scale remote - sensing image generation on a global scale, with controllable multi - resolution and without boundaries. Specifically, the paper points out the following two main challenges: 1. **Dataset Limitations**: - The existing remote - sensing image - text datasets are relatively small and lack sufficient diversity. For example, datasets such as UCM and RSICD are usually limited to specific geographical areas and scene types. - These datasets are usually composed of simple image - text pairs and lack crucial resolution information, which limits the flexibility in real - life scenarios where a specific resolution is required. 2. **Model Limitations**: - Previous models have used techniques such as Generative Adversarial Networks (GANs) and Transformer to improve the generation quality, but these models are difficult to fully capture the complex structured geographical features in remote - sensing scenes on a global scale. - These models ignore the resolution characteristics in remote - sensing images, resulting in the resolution of the generated images being uncertain rather than the resolution specified by the user. - In addition, these models are limited to text - driven image - generation tasks with basically fixed sizes and lack the generalization ability as a basic model in multiple text - driven generation tasks (such as unbounded scene construction and image editing). To solve these problems, the paper makes two main contributions: 1. **Git - 10M Dataset**: - Git - 10M is an image - text dataset on a global scale, containing 10 million pairs of image - text pairs, which is 5 times larger than the previous largest dataset. - This dataset covers a wide range of geographical scenes and contains rich metadata, such as image resolution and geographical location, significantly surpassing the scale and diversity of existing datasets. 2. **Text2Earth Basic Model**: - Text2Earth is a basic generation model based on the diffusion framework, with 1.3 billion parameters, used for modeling remote - sensing scenes on a global scale. - This model introduces a resolution - guiding mechanism, enabling users to specify the image resolution, and proposes a dynamic conditional adaptation strategy to improve the image - generation quality. Through these contributions, the paper aims to promote the development of remote - sensing text - driven image - generation technology towards global - scale scene generation, controllable multi - resolution and unbounded large - scale image synthesis.

Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model

Emage: Non-Autoregressive Text-to-Image Generation

Diversified text-to-image generation via deep mutual information estimation

Diffusion-Geo: A Two-Stage Controllable Text-To-Image Generative Model for Remote Sensing Scenarios

MetaEarth: A Generative Foundation Model for Global-Scale Remote Sensing Image Generation

MMM-RS: A Multi-modal, Multi-GSD, Multi-scene Remote Sensing Dataset and Benchmark for Text-to-Image Generation

AnyText: Multilingual Visual Text Generation And Editing

Visual Text Generation in the Wild

ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models

Generate Any Scene: Evaluating and Improving Text-to-Vision Generation with Scene Graph Programming

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

RSDiff: Remote Sensing Image Generation from Text Using Diffusion Model

DiffusionSat: A Generative Foundation Model for Satellite Imagery

Robust Disaster Assessment from Aerial Imagery Using Text-to-Image Synthetic Data

GLIGEN: Open-Set Grounded Text-to-Image Generation

TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition

AnyText2: Visual Text Generation and Editing With Customizable Attributes

On the Scalability of Diffusion-based Text-to-Image Generation

Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models

Image2Text2Image: A Novel Framework for Label-Free Evaluation of Image-to-Text Generation with Text-to-Image Diffusion Models

Generative Powers of Ten