SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation

Aysim Toker,Marvin Eisenberger,Daniel Cremers,Laura Leal-Taixé
2024-03-25
Abstract:In recent years, semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet, a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work, we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels, leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge, we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data, where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation, as a form of data augmentation. In our experiments, we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data.
Computer Science
What problem does this paper attempt to address?
### Problems the paper attempts to solve The paper aims to solve the problem of scarce and expensive annotation data in the satellite image semantic segmentation task. Specifically, supervised learning techniques usually require a large amount of data manually labeled by experts when dealing with Earth observation data. However, obtaining these high - quality labeled data is costly and time - consuming, resulting in existing datasets often being insufficient to cover all possible scenarios, especially semantic categories that vary greatly in different scales and frequencies. To solve this problem, the authors propose a method based on the Generative Diffusion Model, called **SatSynth**, for synthesizing new image - mask pairs, thereby enhancing the training dataset. In this way, not only can the quality of fine - grained features be improved, but also the diversity of samples can be ensured, thereby significantly improving the performance of downstream semantic segmentation tasks. ### Main contributions 1. **Joint distribution modeling**: For a given Earth observation dataset, the authors propose to use the diffusion model \( G \) to learn the joint data distribution \( p(x, y) \) of the image \( x \) and the label \( y \) in the bit space. 2. **Data augmentation**: Use the generative model \( G \) to generate new training data instances as a form of data augmentation to enhance downstream semantic segmentation tasks. 3. **Experimental verification**: Experiments are carried out on three satellite benchmark datasets [60, 61, 64] to prove that integrating the generated samples can significantly improve the quantitative performance. ### Method overview - **Problem definition**: Researchers consider a dataset \( D \) containing \( N \) satellite images \( x_i \) and their corresponding semantic maps \( y_i \), and assume that this dataset is sampled from the underlying data manifold \( M \). The goal is to perform semantic segmentation by learning \( p(y|x) \). - **Motivation**: To avoid the need for a large amount of labeled data in traditional methods, the authors propose to use the generative diffusion model to approximate the joint distribution \( p(x, y) \) to generate new training samples \((x'_i, y'_i)\), and use these new samples together with the original data for training the semantic segmentation model. - **Discrete label encoding**: To handle discrete labels, the authors convert the label of each pixel into binary encoding \( bin(y_i) \) and concatenate it with the normalized RGB image values as the input of the generative model. - **Generation process**: Train the generative model \( G \) and generate new sample pairs \((x'_i, y'_i)\), and then use them for downstream tasks. In addition, a super - resolution module is introduced to generate higher - resolution images. ### Experimental results - **Visual quality assessment**: On the iSAID, LoveDA and OpenEarthMap datasets, the generated images outperform the baseline methods in terms of FID, sFID and IS metrics. - **Segmentation performance improvement**: On multiple benchmark datasets, after using the generated training data to augment the original dataset, the mIoU and F1 scores of semantic segmentation are significantly improved. In conclusion, this paper solves the problem of scarce labeled data in Earth observation data by introducing the generative diffusion model and shows the effectiveness of this method in improving semantic segmentation performance.