Enhanced Generative Data Augmentation for Semantic Segmentation via Stronger Guidance

Quang-Huy Che,Duc-Tri Le,Vinh-Tiep Nguyen
2024-09-12
Abstract:Data augmentation is a widely used technique for creating training data for tasks that require labeled data, such as semantic segmentation. This method benefits pixel-wise annotation tasks requiring much effort and intensive labor. Traditional data augmentation methods involve simple transformations like rotations and flips to create new images from existing ones. However, these new images may lack diversity along the main semantic axes in the data and not change high-level semantic properties. To address this issue, generative models have emerged as an effective solution for augmenting data by generating synthetic images. Controllable generative models offer a way to augment data for semantic segmentation tasks using a prompt and visual reference from the original image. However, using these models directly presents challenges, such as creating an effective prompt and visual reference to generate a synthetic image that accurately reflects the content and structure of the original. In this work, we introduce an effective data augmentation method for semantic segmentation using the Controllable Diffusion Model. Our proposed method includes efficient prompt generation using Class-Prompt Appending and Visual Prior Combination to enhance attention to labeled classes in real images. These techniques allow us to generate images that accurately depict segmented classes in the real image. In addition, we employ the class balancing algorithm to ensure efficiency when merging the synthetic and original images to generate balanced data for the training dataset. We evaluated our method on the PASCAL VOC datasets and found it highly effective for synthesizing images in semantic segmentation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in semantic segmentation tasks, how to improve model performance through data augmentation while avoiding the large amount of time and cost required for annotating new datasets. Specifically, traditional data augmentation methods (such as simple transformations like rotation and flipping) cannot generate new images with diversity and high - semantic attributes, and although generative models can generate synthetic images, there are challenges when directly applied, such as difficulty in generating images that accurately reflect the original content and structure. To solve these problems, the author proposes a data augmentation method based on the Controllable Diffusion Model, aiming to generate high - quality synthetic images to supplement the original dataset. This method includes the following key steps: 1. **Text Prompt Construction**: Construct more accurate text prompts by combining the descriptions generated by the image caption generation model (such as BLIP - 2) and the class labels in the image. \[ P^*_i = "P^g_i; P^c_i" \] 2. **Visual Prior Combination**: Combine the visual prior information of the image (such as the result of line art edge detection) with the segmentation map to ensure that the layout of the generated image is clear and the label information is retained. \[ V^*_i = \omega_1 V I_i+\omega_2 V S_i \] 3. **Class Balancing Algorithm**: Ensure that the generated synthetic images are evenly distributed among various classes to prevent over - representation of certain classes. 4. **No Post - filtering**: Directly use the generated images for training, demonstrating the effectiveness of the proposed method, and integrate filters when necessary to verify compatibility. Through these improvements, this method can significantly improve the performance of semantic segmentation models without increasing the cost of manual annotation, especially performing well on small - sample datasets. Experimental results show that after combining the augmented data, the performance of multiple semantic segmentation models (such as DeepLabV3 +, PSPNet, Mask2Former) has been significantly improved. ### Key Formula Summary - Text Prompt Construction Formula: \[ P^*_i = "P^g_i; P^c_i" \] - Visual Prior Combination Formula: \[ V^*_i = \omega_1 V I_i+\omega_2 V S_i \] - Class Balancing Algorithm Output Formula: \[ D_{\text{final}} = D_{\text{gen}} \cup D_{\text{origin}} \] These techniques work together to make the generated synthetic images not only visually close to real images but also more reasonable and accurate in terms of class distribution and semantic information.