Abstract:Conventional object detection models are usually limited by the data on which they were trained and by the category logic they define. With the recent rise of Language-Visual Models, new methods have emerged that are not restricted to these fixed categories. Despite their flexibility, such Open Vocabulary detection models still fall short in accuracy compared to traditional models with fixed classes. At the same time, more accurate data-specific models face challenges when there is a need to extend classes or merge different datasets for training. The latter often cannot be combined due to different logics or conflicting class definitions, making it difficult to improve a model without compromising its performance. In this paper, we introduce CerberusDet, a framework with a multi-headed model designed for handling multiple object detection tasks. Proposed model is built on the YOLO architecture and efficiently shares visual features from both backbone and neck components, while maintaining separate task heads. This approach allows CerberusDet to perform very efficiently while still delivering optimal results. We evaluated the model on the PASCAL VOC dataset and Objects365 dataset to demonstrate its abilities. CerberusDet achieved state-of-the-art results with 36% less inference time. The more tasks are trained together, the more efficient the proposed model becomes compared to running individual models sequentially. The training and inference code, as well as the model, are available as open-source (<a class="link-external link-https" href="https://github.com/ai-forever/CerberusDet" rel="external noopener nofollow">this https URL</a>).

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in multi - dataset object detection, specifically including: 1. **Difficulty in class expansion**: When adding new object classes in existing real - time applications, the main problem is that the object classes labeled in different datasets may be inconsistent. Even if some objects are present in the image, they may not be labeled, resulting in difficulty in class expansion. 2. **Problems with dataset merging**: Due to differences in the labeling logic and class definitions of different datasets, it is usually not feasible to directly merge multiple datasets for training. This makes it difficult for the model to be improved without sacrificing performance. 3. **Trade - off between efficiency and accuracy**: Traditional dataset - specific models have high accuracy, but face challenges when it is necessary to expand classes or merge different datasets. While open - vocabulary detection models are highly flexible, their accuracy is usually not as good as that of dataset - specific models and they are prone to overfitting to the base classes. To solve these problems, the authors proposed the **CerberusDet** framework, which is a multi - task object detection model based on the YOLO architecture. By sharing visual features and maintaining independent task heads, this model can be efficiently trained and inferred on multiple datasets while maintaining high accuracy. ### Main contributions 1. **Research on multi - dataset and multi - task detection methods**: Different parameter - sharing strategies and training methods were explored to optimize the effect of multi - task learning. 2. **Experimental results**: Multiple experiments were carried out using public datasets, demonstrating the effectiveness of the proposed method. 3. **Introduction of a novel framework**: A multi - branch object detection model - CerberusDet - that can be customized according to different computational resource requirements was proposed. 4. **Open - source code and model**: The training and inference code as well as the pre - trained model were released to encourage further research and development. ### Core technologies of the solution - **Hard Parameter Sharing**: By sharing the parameters of the backbone network, the consumption of computational resources is reduced, while each task retains its own unique head parameters. - **Representation Similarity Analysis (RSA)**: It is used to estimate the similarity between tasks, thereby determining which modules can be shared and which should be task - specific. - **Gradient Averaging Method**: During the training process, the gradients of the shared parameters are updated on average to balance the conflicts between different tasks. These technologies enable CerberusDet to be efficiently trained on multiple datasets and significantly reduce the computation time during inference while maintaining or even improving the detection accuracy.

CerberusDet: Unified Multi-Dataset Object Detection

Detecting Human-Object Interactions with Object-Guided Cross-Modal Calibrated Semantics.

Universal Object Detection with Large Vision Model

A MultiPath Network for Object Detection

Multiclass objects detection algorithm using DarkNet-53 and DenseNet for intelligent vehicles

ScaleDet: A Scalable Multi-Dataset Object Detector

Cerberus Transformer: Joint Semantic, Affordance and Attribute Parsing

OmDet: Large‐scale vision‐language multi‐dataset pre‐training with multimodal detection network

Multi-Modal Classifiers for Open-Vocabulary Object Detection

Simple Multi-dataset Detection

Plain-Det: A Plain Multi-Dataset Object Detector

Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery

Remote intelligent perception system for multi-object detection

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

CCDS-YOLO: Multi-Category Synthetic Aperture Radar Image Object Detection Model Based on YOLOv5s

UniDetector: Towards Universal Object Detection with Heterogeneous Supervision

UniHead: Unifying Multi-Perception for Detection Heads

OV-Uni3DETR: Towards Unified Open-Vocabulary 3D Object Detection via Cycle-Modality Propagation

Uni$^2$Det: Unified and Universal Framework for Prompt-Guided Multi-dataset 3D Detection

UniDet3D: Multi-dataset Indoor 3D Object Detection

MultIOD: Rehearsal-free Multihead Incremental Object Detector