Abstract:The effective use of multi-scale features remains an open problem for object detection tasks. Recently, proposed object detectors have usually used Feature Pyramid Networks (FPN) to fuse multi-scale features. Since Feature Pyramid Networks use a relatively simple feature map fusion approach, it can lead to the loss or misalignment of semantic information in the fusion process. Several works have demonstrated that using a bottom-up structure in a Feature Pyramid Network can shorten the information path between lower layers and the topmost feature, allowing an adequate exchange of semantic information from different layers. We further enhance the bottom-up path by proposing a multi-scale residual aggregation Feature Pyramid Network (MSRA-FPN), which uses a unidirectional cross-layer residual module to aggregate features from multiple layers bottom-up in a triangular structure to the topmost layer. In addition, we introduce a Residual Squeeze and Excitation Module to mitigate the aliasing effects that occur when features from different layers are aggregated. MSRA-FPN enhances the semantic information of the high-level feature maps, mitigates the information decay during feature fusion, and enhances the detection capability of the model for large objects. It is experimentally demonstrated that our proposed MSRA-FPN improves the performance of the three baseline models by 0.5–1.9% on the PASCAL VOC dataset and is also quite competitive with other state-of-the-art FPN methods. On the MS COCO dataset, our proposed method can also improve the performance of the baseline model by 0.8% and the baseline model's performance for large object detection by 1.8%. To further validate the effectiveness of MSRA-FPN for large object detection, we constructed the Thangka Figure Dataset and conducted comparative experiments. It is experimentally demonstrated that our proposed method improves the performance of the baseline model by 2.9–4.7% on this dataset and can reach up to 71.2%.

Exploring Multi-scale Deep Feature Fusion for Object Detection.

Attention-based Fusion Factor in FPN for Object Detection

NLFFTNet: A non-local feature fusion transformer network for multi-scale object detection

MFIL-FCOS: A Multi-Scale Fusion and Interactive Learning Method for 2D Object Detection and Remote Sensing Image Detection

ℱ3-Net: Feature Fusion and Filtration Network for Object Detection in Optical Remote Sensing Images

Pyramid attention object detection network with multi-scale feature fusion

MDFN: Multi-scale deep feature learning network for object detection

Progressive structure network-based multiscale feature fusion for object detection in real-time application

RCNet: Reverse Feature Pyramid and Cross-scale Shift Network for Object Detection

Improving Object Detection in YOLOv8n with the C2f-f Module and Multi-Scale Fusion Reconstruction

An Adaptive Attention Fusion Mechanism Convolutional Network for Object Detection in Remote Sensing Images

End-to-End Fusion Network of Deep and Hand-Crafted Features for Small Object Detection

MM-FPN: Multi-path and Multi-scale Feature Pyramid Network for Object Detection

AFPN: Asymptotic Feature Pyramid Network for Object Detection

Multi-Scale Residual Aggregation Feature Pyramid Network for Object Detection

Joint-attention feature fusion network and dual-adaptive NMS for object detection

A Saliency Enhanced Feature Fusion based multiscale RGB-D Salient Object Detection Network

Cascaded Multi-Channel Feature Fusion for Object Detection.

Feature Rescaling and Fusion for Tiny Object Detection

Enhanced semantic feature pyramid network for small object detection

NLFA: A Non Local Fusion Alignment Module for Multi-Scale Feature in Object Detection