Abstract:During feature-level data fusion in 3-D object detection, the correlation between different modal data is destroyed by the misalignment problem, which leads to inaccurate localization of small targets at long distances. For the problem, a transformer fusion information enhancement network (TFIENet) is proposed. First, the original point cloud and color images are taken as input. Besides, the standard backbone network of feature extraction is passed to obtain LiDAR point cloud features and image features, respectively. Second, a region proposal network of transformer dual-fusion features is designed, which uses a deformable transformer-decoder to double fuse the extracted LiDAR point cloud features and image features based on a deformed attention mechanism. Moreover, the dual-domain feature information of the LiDAR camera is aggregated to generate the initial candidate frames. Then, the enhancement module of feature information is used to further refine the frame, which predicts the dense depth feature information using a depth complementation mechanism. The corresponding dense depth information and feature semantic information are extracted to complete the box refinement. Finally, for aligning and fusing feature information from different modalities effectively, a multimodal feature cross-attention module (MFCAM) is designed. Moreover, a dynamic cross-attention mechanism is applied to obtain the correlation between different modalities. Experimental results on the KITTI, NuScenes, and Waymo datasets demonstrate the generality and effectiveness of the proposed TFIENet method. Extensive ablation experiments demonstrate the efficiency of each individual module. Experimental results on a real road dataset show that the TFIENet algorithm has strong robustness in complex real road environments.

Image attention transformer network for indoor 3D object detection

TBFNT3D: Two-Branch Fusion Network with Transformer for Multimodal Indoor 3D Object Detection

Investigating Attention Mechanism in 3D Point Cloud Object Detection

Hierarchical Point Attention for Indoor 3D Object Detection

TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation

TFIENet: Transformer Fusion Information Enhancement Network for Multi-Model 3D Object Detection

TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection.

Transformer-based Network for RGB-D Saliency Detection

Multiattention Mechanism 3D Object Detection Algorithm Based on RGB and LiDAR Fusion for Intelligent Driving

ETFormer: an Efficient Transformer Based on Multimodal Hybrid Fusion and Representation Learning for RGB-D-T Salient Object Detection

PI-RCNN: An Efficient Multi-sensor 3D Object Detector with Point-based Attentive Cont-conv Fusion Module

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection

Long-short Range Adaptive Transformer with Dynamic Sampling for 3D Object Detection

AFMCT: adaptive fusion module based on cross-modal transformer block for 3D object detection

Improving 3D Object Detection with Context-Aware and Dimensional Interaction Attention

CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning

Cascaded Cross-Modality Fusion Network for 3D Object Detection

3D target detection using dual domain attention and SIFT operator in indoor scenes

Interactive Context-Aware Network for RGB-T Salient Object Detection

Scanet: Spatial-Channel Attention Network For 3d Object Detection

DMSC-Net: A deep Multi-Scale context network for 3D object detection of indoor point clouds