Abstract:The effective use of multi-scale features remains an open problem for object detection tasks. Recently, proposed object detectors have usually used Feature Pyramid Networks (FPN) to fuse multi-scale features. Since Feature Pyramid Networks use a relatively simple feature map fusion approach, it can lead to the loss or misalignment of semantic information in the fusion process. Several works have demonstrated that using a bottom-up structure in a Feature Pyramid Network can shorten the information path between lower layers and the topmost feature, allowing an adequate exchange of semantic information from different layers. We further enhance the bottom-up path by proposing a multi-scale residual aggregation Feature Pyramid Network (MSRA-FPN), which uses a unidirectional cross-layer residual module to aggregate features from multiple layers bottom-up in a triangular structure to the topmost layer. In addition, we introduce a Residual Squeeze and Excitation Module to mitigate the aliasing effects that occur when features from different layers are aggregated. MSRA-FPN enhances the semantic information of the high-level feature maps, mitigates the information decay during feature fusion, and enhances the detection capability of the model for large objects. It is experimentally demonstrated that our proposed MSRA-FPN improves the performance of the three baseline models by 0.5–1.9% on the PASCAL VOC dataset and is also quite competitive with other state-of-the-art FPN methods. On the MS COCO dataset, our proposed method can also improve the performance of the baseline model by 0.8% and the baseline model's performance for large object detection by 1.8%. To further validate the effectiveness of MSRA-FPN for large object detection, we constructed the Thangka Figure Dataset and conducted comparative experiments. It is experimentally demonstrated that our proposed method improves the performance of the baseline model by 2.9–4.7% on this dataset and can reach up to 71.2%.

Relation-Guided Multi-stage Feature Aggregation Network for Video Object Detection.

Multilevel Spatial-Temporal Feature Aggregation for Video Object Detection

Spatial-Temporal Feature Aggregation Network for Video Object Detection

DFA: Dynamic Feature Aggregation for Efficient Video Object Detection

Temporal-adaptive sparse feature aggregation for video object detection

Adaptive Feature Aggregation for Video Object Detection

DGRNet: A Dual-Level Graph Relation Network for Video Object Detection

Multi-view Aggregation for Real-Time Accurate Object Detection of a Moving Camera

Video Visual Relation Detection Via Multi-modal Feature Fusion

Fianet: Video Object Detection Via Joint Feature-Level and Instance-Level Aggregation

Multi-Scale Residual Aggregation Feature Pyramid Network for Object Detection

Video object detection via space–time feature aggregation and result reuse

Local Attention Sequence Model for Video Object Detection

Adaptive Scale and Spatial Aggregation for Real-Time Object Detection

Practical Video Object Detection via Feature Selection and Aggregation

Real-Time and Accurate Object Detection in Compressed Video by Long Short-term Feature Aggregation

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

Frame Aggregation and Multi-Modal Fusion Framework for Video-Based Person Recognition

Multi-scale spatio-temporal feature adaptive aggregation for video-based Person Re -identification

Memory Enhanced Global-Local Aggregation for Video Object Detection.

Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection