Abstract:During feature-level data fusion in 3-D object detection, the correlation between different modal data is destroyed by the misalignment problem, which leads to inaccurate localization of small targets at long distances. For the problem, a transformer fusion information enhancement network (TFIENet) is proposed. First, the original point cloud and color images are taken as input. Besides, the standard backbone network of feature extraction is passed to obtain LiDAR point cloud features and image features, respectively. Second, a region proposal network of transformer dual-fusion features is designed, which uses a deformable transformer-decoder to double fuse the extracted LiDAR point cloud features and image features based on a deformed attention mechanism. Moreover, the dual-domain feature information of the LiDAR camera is aggregated to generate the initial candidate frames. Then, the enhancement module of feature information is used to further refine the frame, which predicts the dense depth feature information using a depth complementation mechanism. The corresponding dense depth information and feature semantic information are extracted to complete the box refinement. Finally, for aligning and fusing feature information from different modalities effectively, a multimodal feature cross-attention module (MFCAM) is designed. Moreover, a dynamic cross-attention mechanism is applied to obtain the correlation between different modalities. Experimental results on the KITTI, NuScenes, and Waymo datasets demonstrate the generality and effectiveness of the proposed TFIENet method. Extensive ablation experiments demonstrate the efficiency of each individual module. Experimental results on a real road dataset show that the TFIENet algorithm has strong robustness in complex real road environments.

MMFENet:Multi-Modal Feature Enhancement Network with Transformer for Human-Object Interaction Detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Multi-Scale Human-Object Interaction Detector.

Human–object interaction detection based on disentangled axial attention transformer

Adaptive multimodal prompt for human-object interaction with local feature enhanced transformer

Focus and Adjust: Progressive Refinement Network for Human Object Interaction Detection

Parallel disentangling network for human–object interaction detection

Human-object interaction detection based on cascade multi-scale transformer

Pairwise CNN-Transformer Features for Human–Object Interaction Detection

Geometric Features Enhanced Human-Object Interaction Detection

Human-Object Interaction detection via Global Context and Pairwise-level Fusion Features Integration

A Novel Part Refinement Tandem Transformer for Human-Object Interaction Detection

M2FNet: Multi-modal Fusion Network for Object Detection from Visible and Thermal Infrared Images

Detecting Human—object Interaction with Multi-Level Pairwise Feature Network

Geometric Features Enhanced Human–Object Interaction Detection

TFIENet: Transformer Fusion Information Enhancement Network for Multi-Model 3D Object Detection

Transformer fusion and histogram layer multispectral pedestrian detection network

Human-Object Interaction Detection via Disentangled Transformer

TFIENet: Transformer Fusion Information Enhancement Network for Multimodel 3-D Object Detection.

Category-Aware Transformer Network for Better Human-Object Interaction Detection

Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection