MuTrans: Multiple Transformers for Fusing Feature Pyramid on 2D and 3D Object Detection

Bangquan Xie,Liang Yang,Ailin Wei,Xiaoxiong Weng,Bing Li
DOI: https://doi.org/10.1109/TIP.2023.3299190
Abstract:One of the major components of the neural network, the feature pyramid plays a vital part in perception tasks, like object detection in autonomous driving. But it is a challenge to fuse multi-level and multi-sensor feature pyramids for object detection. This paper proposes a simple yet effective framework named MuTrans (Mu ltiple Trans formers) to fuse feature pyramid in single-stream 2D detector or two-stream 3D detector. The MuTrans based on encoder-decoder focuses on the significant features via multiple Transformers. MuTrans encoder uses three innovative self-attention mechanisms: S patial-wise B oxAlign attention (SB) for low-level spatial locations, C ontext-wise A ffinity attention (CA) for high-level context information, and high-level attention for multi-level features. Then MuTrans decoder processes these significant proposals including the RoI and context affinity. Besides, the L ow and H igh-level F usion (LHF) in the encoder reduces the number of computational parameters. And the Pre-LN is utilized to accelerate the training convergence. LHF and Pre-LN are proven to reduce self-attention's computational complexity and slow training convergence. Our result demonstrates the higher detection accuracy of MuTrans than that of the baseline method, particularly in small object detection. MuTrans demonstrates a 2.1 higher detection accuracy on APS index in small object detection on MS-COCO 2017 with ResNeXt-101 backbone, a 2.18 higher 3D detection accuracy (moderate difficulty) for small object-pedestrian on KITTI, and 6.85 higher RC index (Town05 Long) on CARLA urban driving simulator platform.
What problem does this paper attempt to address?