Abstract:During feature-level data fusion in 3-D object detection, the correlation between different modal data is destroyed by the misalignment problem, which leads to inaccurate localization of small targets at long distances. For the problem, a transformer fusion information enhancement network (TFIENet) is proposed. First, the original point cloud and color images are taken as input. Besides, the standard backbone network of feature extraction is passed to obtain LiDAR point cloud features and image features, respectively. Second, a region proposal network of transformer dual-fusion features is designed, which uses a deformable transformer-decoder to double fuse the extracted LiDAR point cloud features and image features based on a deformed attention mechanism. Moreover, the dual-domain feature information of the LiDAR camera is aggregated to generate the initial candidate frames. Then, the enhancement module of feature information is used to further refine the frame, which predicts the dense depth feature information using a depth complementation mechanism. The corresponding dense depth information and feature semantic information are extracted to complete the box refinement. Finally, for aligning and fusing feature information from different modalities effectively, a multimodal feature cross-attention module (MFCAM) is designed. Moreover, a dynamic cross-attention mechanism is applied to obtain the correlation between different modalities. Experimental results on the KITTI, NuScenes, and Waymo datasets demonstrate the generality and effectiveness of the proposed TFIENet method. Extensive ablation experiments demonstrate the efficiency of each individual module. Experimental results on a real road dataset show that the TFIENet algorithm has strong robustness in complex real road environments.

CTAFFNet: CNN–Transformer Adaptive Feature Fusion Object Detection Algorithm for Complex Traffic Scenarios

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Traffic Sign Detection using Feature Fusion and Contextual Information

CTIF-Net: A CNN-Transformer Iterative Fusion Network for Salient Object Detection

CMNet: A Connect-and-Merge Convolutional Neural Network for Fast Vehicle Detection in Urban Traffic Surveillance.

An Adaptive Hybrid Attention Based Convolutional Neural Net for Intelligent Transportation Object Recognition

Multifeature Fusion-Based Object Detection for Intelligent Transportation Systems

Multiclass objects detection algorithm using DarkNet-53 and DenseNet for intelligent vehicles

AMFF-Net: An Effective 3D Object Detector Based on Attention and Multi-Scale Feature Fusion

DetectFormer: Category-Assisted Transformer for Traffic Scene Object Detection.

TFIENet: Transformer Fusion Information Enhancement Network for Multi-Model 3D Object Detection

MFR-CNN: Incorporating Multi-Scale Features and Global Information for Traffic Object Detection

CAFFNet: Channel Attention and Feature Fusion Network for Multi-target Traffic Sign Detection

An Efficient Hierarchical Convolutional Neural Network for Traffic Object Detection

TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection.

Transformer-enabled Adaptive Spatial Feature Fusion for Small Traffic Sign Detection

Enhanced Object Detection with Deep Convolutional Neural Networks for Advanced Driving Assistance

NLFFTNet: A non-local feature fusion transformer network for multi-scale object detection

GHAFNet: Global-context hierarchical attention fusion method for traffic object detection

A Small-Scale Object Detection Algorithm in Intelligent Transportation Scenarios

Combining transformer global and local feature extraction for object detection