Abstract:In this paper, we present a novel paradigm to enhance the ability of object detector, e.g., expanding categories or improving detection performance, by training on synthetic dataset generated from diffusion models. Specifically, we integrate an instance-level grounding head into a pre-trained, generative diffusion model, to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that, this enhanced version of diffusion model, termed as InstaGen, can serve as a data synthesizer, to enhance object detectors by training on its generated samples, demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 to 5.2 AP) scenarios. Project page with code:

What problem does this paper attempt to address?

The problem that this paper attempts to solve is in the object detection task, enhancing the capabilities of object detectors by using synthetic datasets generated from diffusion models, such as expanding the detectable categories or improving the detection performance. Specifically, the paper proposes a new method, that is, improving the performance of object detectors by training on synthetic datasets, which are generated by diffusion models and contain instance - level bounding box annotations. This method aims to solve the problems of difficult collection, time - consuming and difficult expansion of existing real datasets, while using synthetic datasets to improve the performance of detectors in scenarios such as open - vocabulary detection, data - sparse detection and cross - dataset transfer. ### Main Contributions 1. **Image Synthesizer**: By fine - tuning the existing diffusion models, an image synthesizer that can generate images containing multiple objects and complex backgrounds is constructed, thus providing a simulation closer to real - world detection scenarios. 2. **Data Synthesis Framework**: A new data synthesis framework - InstaGen is introduced, through a novel instance - level localization module, which can generate bounding box annotations for objects in synthetic images. 3. **Detector Training**: Standard object detectors are trained on the combination of real datasets and synthetic datasets, and superior performance in multiple benchmark tests is demonstrated, especially in open - vocabulary detection (AP improvement +4.5), data - sparse detection (AP improvement +1.2 to +5.2) and cross - dataset transfer (AP improvement +0.5 to +1.1). ### Method Overview - **Problem Definition**: Given a real - image detection dataset \( D_{\text{real}} \) with manual annotations, the goal is to use this real dataset to guide a generative diffusion model to become a data synthesizer to expand the existing detection dataset \( D_{\text{final}}=D_{\text{real}} + D_{\text{syn}} \). - **Image Synthesizer**: Based on the pre - trained Stable Diffusion Model, images containing multiple objects and complex backgrounds are generated by fine - tuning on the detection dataset. - **Instance - level Localization Module**: Through a two - step training strategy, first, the localization module is supervised and trained on synthetic images, and then the trained localization head is used to generate pseudo - labels for self - training of unseen categories, and finally, object localization for any category is achieved. - **Detector Training**: Object detectors are trained on the combination of real datasets and synthetic datasets, and superior performance in multiple benchmark tests is demonstrated. ### Experimental Results - **Open - vocabulary Detection**: In the COCO benchmark test, compared with the existing CLIP - based methods, the AP50 index of the detector trained with the synthetic dataset generated by InstaGen on new categories is significantly increased by about +5. - **Data - sparse Detection**: When the amount of real data is limited, the use of synthetic datasets significantly improves the performance of the detector. - **Cross - dataset Transfer**: In the transfer tasks from the COCO dataset to the Object365 and LVIS datasets, the synthetic dataset generated by InstaGen also performs well. In conclusion, the method proposed in this paper not only solves the difficult problem of collecting existing real datasets, but also significantly improves the performance of object detectors in multiple scenarios through synthetic datasets.

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

A Simple Background Augmentation Method for Object Detection with Diffusion Model

Gen2Det: Generate to Detect

AeroGen: Enhancing Remote Sensing Object Detection with Diffusion-Driven Data Generation

DIAGen: Diverse Image Augmentation with Generative Models

The Big Data Myth: Using Diffusion Models for Dataset Generation to Train Deep Detection Models

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Diverse Generation while Maintaining Semantic Coordination: A Diffusion-Based Data Augmentation Method for Object Detection

Towards Multi-domain Face Landmark Detection with Synthetic Data from Diffusion model

Synthetic Data from Diffusion Models Improves ImageNet Classification

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

Improving Synthetic Image Detection Towards Generalization: An Image Transformation Perspective

FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion

Stable Diffusion Dataset Generation for Downstream Classification Tasks

DiffusionSeg: Adapting Diffusion Towards Unsupervised Object Discovery.

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection