Abstract:Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. In this paper, we propose a novel and flexible backbone framework, namely CBNetV2, to construct high-performance detectors using existing open-sourced pre-trained backbones under the pre-training fine-tuning paradigm. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. Specifically, it integrates the high- and low-level features of multiple backbone networks and gradually expands the receptive field to more efficiently perform object detection. We also propose a better training strategy with assistant supervision for CBNet-based detectors. Without additional pre-training of the composite backbone, CBNetV2 can be adapted to various backbones (CNN-based vs. Transformer-based) and head designs of most mainstream detectors (one-stage vs. two-stage, anchor-based vs. anchor-free-based). Experiments provide strong evidence that, compared with simply increasing the depth and width of the network, CBNetV2 introduces a more efficient, effective, and resource-friendly way to build high-performance backbone networks. Particularly, our Dual-Swin-L achieves 59.4% box AP and 51.6% mask AP on COCO test-dev under the single-model and single-scale testing protocol, which is significantly better than the state-of-the-art result (57.7% box AP and 50.2% mask AP) achieved by Swin-L, while the training schedule is reduced by 6$\times$. With multi-scale testing, we push the current best single model result to a new record of 60.1% box AP and 52.3% mask AP without using extra training data. Code is available at <a class="link-external link-https" href="https://github.com/VDIGPKU/CBNetV2" rel="external noopener nofollow">this https URL</a>.

CrossNet: Detecting Objects As Crosses.

SSF: Sparse Point Cloud Object Detection Based on Self-Adaptive Voxel Encoding and Focal-Sparse Convolution

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

DE-CrossDet: Divisible and Extensible Crossline Representation for Object Detection

Hierarchical Multi-Scale Network for Cross-Scale Visual Defect Detection

C2BG-Net: Cross-modality and cross-scale balance network with global semantics for multi-modal 3D object detection

CFANet: A Cross-layer Feature Aggregation Network for Camouflaged Object Detection

Multi-Scale Interactive Network for Salient Object Detection

CenterNet: Keypoint Triplets for Object Detection

Improved CenterNet for Accurate and Fast Fitting Object Detection

Objects as Points

CenterNet3D: An Anchor Free Object Detector for Point Cloud

CBi-GNN: Cross-Scale Bilateral Graph Neural Network for 3D Object Detection

CBNet: A Composite Backbone Network Architecture for Object Detection

Cross-Layer Attention Network for Small Object Detection in Remote Sensing Imagery

Cross-dataset Training for Class Increasing Object Detection

HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection

X-CDNet: A real-time crosswalk detector based on YOLOX

Semantic-aware 3D-voxel CenterNet for point cloud object detection

Spatial-Transformer and Cross-Scale Fusion Network (STCS-Net) for Small Object Detection in Remote Sensing Images