Abstract:Facing the significant challenge of 3D object detection in complex weather conditions and road environments, existing algorithms based on single-frame point cloud data struggle to achieve desirable results. These methods typically focus on spatial relationships within a single frame, overlooking the semantic correlations and spatiotemporal continuity between consecutive frames. This leads to discontinuities and abrupt changes in the detection outcomes. To address this issue, this paper proposes a multi-frame 3D object detection algorithm based on a deformable spatiotemporal Transformer. Specifically, a deformable cross-scale Transformer module is devised, incorporating a multi-scale offset mechanism that non-uniformly samples features at different scales, enhancing the spatial information aggregation capability of the output features. Simultaneously, to address the issue of feature misalignment during multi-frame feature fusion, a deformable cross-frame Transformer module is proposed. This module incorporates independently learnable offset parameters for different frame features, enabling the model to adaptively correlate dynamic features across multiple frames and improve the temporal information utilization of the model. A proposal-aware sampling algorithm is introduced to significantly increase the foreground point recall, further optimizing the efficiency of feature extraction. The obtained multi-scale and multi-frame voxel features are subjected to an adaptive fusion weight extraction module, referred to as the proposed mixed voxel set extraction module. This module allows the model to adaptively obtain mixed features containing both spatial and temporal information. The effectiveness of the proposed algorithm is validated on the KITTI, nuScenes, and self-collected urban datasets. The proposed algorithm achieves an average precision improvement of 2.1% over the latest multi-frame-based algorithms.

MuTrans: Multiple Transformers for Fusing Feature Pyramid on 2D and 3D Object Detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

DETR++: Taming Your Multi-Scale Detection Transformer

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

Multi-scale Cross-Modal Transformer Network for RGB-D Object Detection

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object Detection with Transformers

Multimodal Transformer for Automatic 3D Annotation and Object Detection

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

A Novel Pyramid Network with Feature Fusion and Disentanglement for Object Detection

CNN-transformer mixed model for object detection

Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection

Transformed Dynamic Feature Pyramid for Small Object Detection

Multi-Source Features Fusion Single Stage 3D Object Detection with Transformer.

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

MM-FPN: Multi-path and Multi-scale Feature Pyramid Network for Object Detection

MCANet: Hierarchical cross-fusion lightweight transformer based on multi-ConvHead attention for object detection

Multi-scale Feature Fusion with Point Pyramid for 3D Object Detection

Multiple-in-Single-out Object Detector Leveraging Spiking Neural Membrane Systems and Multiple Transformers

3D Object Detection Based on Attention and Multi-Scale Feature Fusion