MosaicFusion: Diffusion Models as Data Augmenters for Large Vocabulary Instance Segmentation

Jiahao Xie,Wei Li,Xiangtai Li,Ziwei Liu,Yew Soon Ong,Chen Change Loy
2024-10-04
Abstract:We present MosaicFusion, a simple yet effective diffusion-based data augmentation approach for large vocabulary instance segmentation. Our method is training-free and does not rely on any label supervision. Two key designs enable us to employ an off-the-shelf text-to-image diffusion model as a useful dataset generator for object instances and mask annotations. First, we divide an image canvas into several regions and perform a single round of diffusion process to generate multiple instances simultaneously, conditioning on different text prompts. Second, we obtain corresponding instance masks by aggregating cross-attention maps associated with object prompts across layers and diffusion time steps, followed by simple thresholding and edge-aware refinement processing. Without bells and whistles, our MosaicFusion can produce a significant amount of synthetic labeled data for both rare and novel categories. Experimental results on the challenging LVIS long-tailed and open-vocabulary benchmarks demonstrate that MosaicFusion can significantly improve the performance of existing instance segmentation models, especially for rare and novel categories. Code: <a class="link-external link-https" href="https://github.com/Jiahao000/MosaicFusion" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the performance degradation in large - scale vocabulary instance segmentation tasks due to data scarcity, especially for rare and new classes. Specifically, the author points out that manually annotating large - scale instance segmentation datasets is extremely time - consuming and costly, especially when it comes to complex visual scenes with a large number of different classes. This data - scarcity situation is more severe in natural data distributions, as these distributions usually contain rare classes with low sample sizes and out - of - distribution new classes, which leads to the poor performance of existing instance segmentation models in long - tail distributions and open - vocabulary scenarios. To solve this problem, the author proposes the MosaicFusion method, a data - augmentation technique based on diffusion models, aiming to generate a large amount of synthetically - annotated data to improve the detection and segmentation effects for rare and new classes. The core contributions of MosaicFusion are as follows: 1. **No Additional Training Required**: MosaicFusion is a data - augmentation pipeline that does not require additional training and can generate images and corresponding instance masks simultaneously without relying on off - the - shelf object detection or segmentation models to further annotate data. 2. **Customized Object Generation**: This method allows the generation of multiple customized objects at specific positions in a single image and has studied single - object and multi - object image - generation scenarios, finding that generating multi - object images is more beneficial than single - object images. 3. **Significantly Improved Performance**: Extensive experiments show that MosaicFusion can significantly improve the performance of existing object detectors and instance segmenters, especially when dealing with rare and unseen classes. Through these contributions, MosaicFusion provides an effective method for solving the data - scarcity problem in large - scale vocabulary instance segmentation.