Abstract:Facing the significant challenge of 3D object detection in complex weather conditions and road environments, existing algorithms based on single-frame point cloud data struggle to achieve desirable results. These methods typically focus on spatial relationships within a single frame, overlooking the semantic correlations and spatiotemporal continuity between consecutive frames. This leads to discontinuities and abrupt changes in the detection outcomes. To address this issue, this paper proposes a multi-frame 3D object detection algorithm based on a deformable spatiotemporal Transformer. Specifically, a deformable cross-scale Transformer module is devised, incorporating a multi-scale offset mechanism that non-uniformly samples features at different scales, enhancing the spatial information aggregation capability of the output features. Simultaneously, to address the issue of feature misalignment during multi-frame feature fusion, a deformable cross-frame Transformer module is proposed. This module incorporates independently learnable offset parameters for different frame features, enabling the model to adaptively correlate dynamic features across multiple frames and improve the temporal information utilization of the model. A proposal-aware sampling algorithm is introduced to significantly increase the foreground point recall, further optimizing the efficiency of feature extraction. The obtained multi-scale and multi-frame voxel features are subjected to an adaptive fusion weight extraction module, referred to as the proposed mixed voxel set extraction module. This module allows the model to adaptively obtain mixed features containing both spatial and temporal information. The effectiveness of the proposed algorithm is validated on the KITTI, nuScenes, and self-collected urban datasets. The proposed algorithm achieves an average precision improvement of 2.1% over the latest multi-frame-based algorithms.

Volumetric Spatial Transformer Network for Object Recognition.

OVPT: Optimal Viewset Pooling Transformer for 3D Object Recognition.

Spatial Transformer for 3D Point Clouds

Multi-view 3D Reconstruction with Transformer

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

SparseVoxNet: 3-D Object Recognition With Sparsely Aggregation of 3-D Dense Blocks

VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation

TMVNet : Using Transformers for Multi-view Voxel-based 3D Reconstruction

Hybrid CNN-Transformer Features for Visual Place Recognition

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

Complementary spatial transformer network for real-time 3D object recognition

CodedVTR: Codebook-based Sparse Voxel Transformer with Geometric Guidance

3D Former: Monocular Scene Reconstruction with 3D SDF Transformers

M&M3D: Multi-Dataset Training and Efficient Network for Multi-view 3D Object Detection

MsSVT++: Mixed-scale Sparse Voxel Transformer with Center Voting for 3D Object Detection

CenterFormer: Center-based Transformer for 3D Object Detection

VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

3D Morphable Models as Spatial Transformer Networks

LVNet: A lightweight volumetric convolutional neural network for real-time and high-performance recognition of 3D objects

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection