Abstract:In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code:
What problem does this paper attempt to address?
The problem that this paper attempts to solve is in the object detection task, enhancing the capabilities of object detectors by using synthetic datasets generated from diffusion models, such as expanding the detectable categories or improving the detection performance. Specifically, the paper proposes a new method, that is, improving the performance of object detectors by training on synthetic datasets, which are generated by diffusion models and contain instance - level bounding box annotations. This method aims to solve the problems of difficult collection, time - consuming and difficult expansion of existing real datasets, while using synthetic datasets to improve the performance of detectors in scenarios such as open - vocabulary detection, data - sparse detection and cross - dataset transfer.
### Main Contributions
1. **Image Synthesizer**: By fine - tuning the existing diffusion models, an image synthesizer that can generate images containing multiple objects and complex backgrounds is constructed, thus providing a simulation closer to real - world detection scenarios.
2. **Data Synthesis Framework**: A new data synthesis framework - InstaGen is introduced, through a novel instance - level localization module, which can generate bounding box annotations for objects in synthetic images.
3. **Detector Training**: Standard object detectors are trained on the combination of real datasets and synthetic datasets, and superior performance in multiple benchmark tests is demonstrated, especially in open - vocabulary detection (AP improvement +4.5), data - sparse detection (AP improvement +1.2 to +5.2) and cross - dataset transfer (AP improvement +0.5 to +1.1).
### Method Overview
- **Problem Definition**: Given a real - image detection dataset \( D_{\text{real}} \) with manual annotations, the goal is to use this real dataset to guide a generative diffusion model to become a data synthesizer to expand the existing detection dataset \( D_{\text{final}}=D_{\text{real}} + D_{\text{syn}} \).
- **Image Synthesizer**: Based on the pre - trained Stable Diffusion Model, images containing multiple objects and complex backgrounds are generated by fine - tuning on the detection dataset.
- **Instance - level Localization Module**: Through a two - step training strategy, first, the localization module is supervised and trained on synthetic images, and then the trained localization head is used to generate pseudo - labels for self - training of unseen categories, and finally, object localization for any category is achieved.
- **Detector Training**: Object detectors are trained on the combination of real datasets and synthetic datasets, and superior performance in multiple benchmark tests is demonstrated.
### Experimental Results
- **Open - vocabulary Detection**: In the COCO benchmark test, compared with the existing CLIP - based methods, the AP50 index of the detector trained with the synthetic dataset generated by InstaGen on new categories is significantly increased by about +5.
- **Data - sparse Detection**: When the amount of real data is limited, the use of synthetic datasets significantly improves the performance of the detector.
- **Cross - dataset Transfer**: In the transfer tasks from the COCO dataset to the Object365 and LVIS datasets, the synthetic dataset generated by InstaGen also performs well.
In conclusion, the method proposed in this paper not only solves the difficult problem of collecting existing real datasets, but also significantly improves the performance of object detectors in multiple scenarios through synthetic datasets.