ODGEN: Domain-specific Object Detection Data Generation with Diffusion Models

Jingyuan Zhu,Shiyu Li,Yuxuan Liu,Ping Huang,Jiulong Shan,Huimin Ma,Jian Yuan
2024-05-24
Abstract:Modern diffusion-based image generative models have made significant progress and become promising to enrich training data for the object detection task. However, the generation quality and the controllability for complex scenes containing multi-class objects and dense objects with occlusions remain limited. This paper presents ODGEN, a novel method to generate high-quality images conditioned on bounding boxes, thereby facilitating data synthesis for object detection. Given a domain-specific object detection dataset, we first fine-tune a pre-trained diffusion model on both cropped foreground objects and entire images to fit target distributions. Then we propose to control the diffusion model using synthesized visual prompts with spatial constraints and object-wise textual descriptions. ODGEN exhibits robustness in handling complex scenes and specific domains. Further, we design a dataset synthesis pipeline to evaluate ODGEN on 7 domain-specific benchmarks to demonstrate its effectiveness. Adding training data generated by ODGEN improves up to 25.3% mAP@.50:.95 with object detectors like YOLOv5 and YOLOv7, outperforming prior controllable generative methods. In addition, we design an evaluation protocol based on COCO-2014 to validate ODGEN in general domains and observe an advantage up to 5.6% in mAP@.50:.95 against existing methods.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to generate high - quality images in complex scenarios to enrich the training data for object detection tasks. Specifically, the paper points out that the current image generation methods based on diffusion models still have limited generation quality and controllability when dealing with complex scenarios such as multi - class objects, dense objects, and occlusions. To address these challenges, the paper proposes ODGEN (Domain - specific Object Detection Data Generation with Diffusion Models), a new method that generates high - quality images conditional on bounding boxes, thereby facilitating data synthesis for object detection. ### Main Problems 1. **Generation Quality and Controllability**: Existing diffusion models have insufficient generation quality and controllability when generating complex scenarios containing multi - class objects, dense objects, and occlusions. 2. **Domain Differences**: Large - scale pre - trained diffusion models are usually trained on web - crawled datasets such as LAION, and their distributions may be significantly different from those of domain - specific datasets, resulting in a decrease in the fidelity of generated images. 3. **Concept Confusion**: Multiple class objects in text prompts may lead to the "concept confusion" problem, that is, different visual elements are unintentionally merged or overlapped in the image. 4. **Object Merging and Ignoring**: Existing methods may merge overlapping objects into a single object when dealing with them, or in some cases, ignore objects and not generate the foreground. ### Solutions 1. **Domain - Specific Fine - Tuning**: Fine - tune the pre - trained diffusion model using not only the entire image but also the cropped foreground objects to improve the synthesis quality of the background scene and foreground objects. 2. **Object - Level Conditional Control**: - **Text List Encoding**: Encode the class name of each object separately to avoid mutual interference between different concepts. - **Image List Encoding**: Resize and paste the generated foreground object images onto an empty canvas according to the bounding box annotations to provide conceptual and spatial information. 3. **Foreground - Background Discriminator**: Train a foreground - background discriminator to check whether each pseudo - label region contains a synthesized object and filter out unsuccessfully generated objects. ### Experimental Results - **Domain - Specific**: On seven representative datasets of Roboflow - 100, the images generated by ODGEN outperform other methods in both FID score and mAP@.50:.95 score of YOLOv5/YOLOv7. - **General - Domain**: On the COCO - 2014 dataset, ODGEN also significantly outperforms other methods in FID score and mAP@.50:.95 score. ### Main Contributions 1. Proposed a new method to generate high - quality domain - specific objects and background scenes by fine - tuning the diffusion model. 2. Designed an object - level conditional control strategy to improve the ability to generate and control complex scenarios. 3. Verified the effectiveness of synthetic data through a large number of experiments and demonstrated superior performance in specific and general domains. ### Formulas - **Reconstruction Loss**: \[ L_{\text{rec}} = E_{x_o, t, \epsilon_o \sim N(0,1)} \left[ ||\epsilon_o - \epsilon_\theta(x_t^o, t, \tau(c_o))||^2 \right] + \lambda E_{x_s, t, \epsilon_s \sim N(0,1)} \left[ ||\epsilon_s - \epsilon_\theta(x_t^s, t, \tau(c_s))||^2 \right] \] where \(\lambda\) controls the relative weight of the scene image reconstruction loss, and \(\tau\) is the frozen CLIP text encoder. - **Control Loss**: \[ L_{\text{control}} = L_{\text{recon}} + \gamma L_{\text{recon}} \odot M \] where \(M\) is a binary mask with 1 for foreground pixels and 0 for background pixels, and \(\odot\)