Abstract:We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: <a class="link-external link-https" href="https://github.com/Jiahao000/MosaicFusion" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the performance degradation in large - scale vocabulary instance segmentation tasks due to data scarcity, especially for rare and new classes. Specifically, the author points out that manually annotating large - scale instance segmentation datasets is extremely time - consuming and costly, especially when it comes to complex visual scenes with a large number of different classes. This data - scarcity situation is more severe in natural data distributions, as these distributions usually contain rare classes with low sample sizes and out - of - distribution new classes, which leads to the poor performance of existing instance segmentation models in long - tail distributions and open - vocabulary scenarios. To solve this problem, the author proposes the MosaicFusion method, a data - augmentation technique based on diffusion models, aiming to generate a large amount of synthetically - annotated data to improve the detection and segmentation effects for rare and new classes. The core contributions of MosaicFusion are as follows: 1. **No Additional Training Required**: MosaicFusion is a data - augmentation pipeline that does not require additional training and can generate images and corresponding instance masks simultaneously without relying on off - the - shelf object detection or segmentation models to further annotate data. 2. **Customized Object Generation**: This method allows the generation of multiple customized objects at specific positions in a single image and has studied single - object and multi - object image - generation scenarios, finding that generating multi - object images is more beneficial than single - object images. 3. **Significantly Improved Performance**: Extensive experiments show that MosaicFusion can significantly improve the performance of existing object detectors and instance segmenters, especially when dealing with rare and unseen classes. Through these contributions, MosaicFusion provides an effective method for solving the data - scarcity problem in large - scale vocabulary instance segmentation.

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

Open-vocabulary Object Segmentation with Diffusion Models

Diff-Mosaic: Augmenting Realistic Representations in Infrared Small Target Detection via Diffusion Prior

A Mamba-Diffusion Framework for Multimodal Remote Sensing Image Semantic Segmentation

Diffusion Models for Open-Vocabulary Segmentation

MaskDiffusion: Exploiting Pre-Trained Diffusion Models for Semantic Segmentation

Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes

Open-Vocabulary 3D Semantic Segmentation with Text-to-Image Diffusion Models

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models

A Simple Background Augmentation Method for Object Detection with Diffusion Model

MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Saliency information and mosaic based data augmentation method for densely occluded object recognition

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

P-MSDiff: Parallel Multi-Scale Diffusion for Remote Sensing Image Segmentation