Abstract:Video object detection has attracted increasing attention in recent years. Although great success has been achieved by off-the-shelf video object detection methods through delicately designing various types of feature aggregation, they overlook the class-aware supervision and thus still suffer from the problem of classification incapability, which means the classification between objects with deteriorated or similar appearances is error-prone. In this paper, we propose a novel class-aware dual-supervised aggregation network (CDANet) for video object detection, including three substantial improvements to effectively alleviate the classification incapability problem of previous methods. First, we develop a class-aware cross-modality distillation supervision that transfers the semantic knowledge of label data to the features of video data, effectively enhancing the semantic representations of features. Second, we design a graph-guided feature aggregation module that effectively models the structural relations between features by leveraging the dynamic residual graph convolutional network, enabling our CDANet to perform more effective feature aggregation in the temporal domain. Third, we present a class-aware proposal contrastive supervision to maximize the intra-class agreement and inter-class disagreement, which is conducive to improving the semantic discriminability of features. The class-aware dual supervision and feature aggregation are tightly tied into a unified end-to-end framework to make our CDANet fully exploit class-specific semantic knowledge and inter-frame temporal dependencies to enhance object appearance representations, which facilitates the classification of detected objects. We conduct experiments on the challenging ImageNet VID dataset, and the results demonstrate the superiority of our CDANet against state-of-the-art methods. More remarkably, our CDANet achieves 85.4% mAP with ResNet-101 or 86.5% mAP with ResNeXt-101.

Dual Selection Network for Video Object Detection

Dual-Branch Feature Fusion Network for Salient Object Detection

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Dual Semantic Fusion Network for Video Object Detection

Bidirectional Cross-Selective Attention Network for Video Salient Object Detection

Proposal Distillation of Multi-Modal Feature Aggregation Network for Video Object Detection

Class-Aware Dual-Supervised Aggregation Network for Video Object Detection

Feature Selective Networks for Object Detection

Dual Refinement Network for Single-Shot Object Detection

EBiDA-FPN: Enhanced Bi-Directional Attention Feature Pyramid Network for Object Detection

DSFD: Dual Shot Face Detector

DGRNet: A Dual-Level Graph Relation Network for Video Object Detection

M 2rnet: Multi-modal and Multi-Scale Refined Network for RGB-D Salient Object Detection

Dual Attention Based Image Pyramid Network for Object Detection.

DPSSD: Dual-Path Single-Shot Detector

Dynamic Selection Network for Rgb-D Salient Object Detection.

Confidence-guided Adaptive Gate and Dual Differential Enhancement for Video Salient Object Detection

DualHead for One-stage Object Detection Networks with Receptive Field Enhancement

M2RNet: Multi-modal and Multi-scale Refined Network for RGB-D Salient Object Detection

Df-net: diversity-focused network for video object detection

HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection