Abstract:Facing the significant challenge of 3D object detection in complex weather conditions and road environments, existing algorithms based on single-frame point cloud data struggle to achieve desirable results. These methods typically focus on spatial relationships within a single frame, overlooking the semantic correlations and spatiotemporal continuity between consecutive frames. This leads to discontinuities and abrupt changes in the detection outcomes. To address this issue, this paper proposes a multi-frame 3D object detection algorithm based on a deformable spatiotemporal Transformer. Specifically, a deformable cross-scale Transformer module is devised, incorporating a multi-scale offset mechanism that non-uniformly samples features at different scales, enhancing the spatial information aggregation capability of the output features. Simultaneously, to address the issue of feature misalignment during multi-frame feature fusion, a deformable cross-frame Transformer module is proposed. This module incorporates independently learnable offset parameters for different frame features, enabling the model to adaptively correlate dynamic features across multiple frames and improve the temporal information utilization of the model. A proposal-aware sampling algorithm is introduced to significantly increase the foreground point recall, further optimizing the efficiency of feature extraction. The obtained multi-scale and multi-frame voxel features are subjected to an adaptive fusion weight extraction module, referred to as the proposed mixed voxel set extraction module. This module allows the model to adaptively obtain mixed features containing both spatial and temporal information. The effectiveness of the proposed algorithm is validated on the KITTI, nuScenes, and self-collected urban datasets. The proposed algorithm achieves an average precision improvement of 2.1% over the latest multi-frame-based algorithms.

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving.

SEFormer: Structure Embedding Transformer for 3D Object Detection

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

Anchor-Based Transformer for Temporal LiDAR 3D Object Detection

Multi-Scale Spatial Transformer Network for LiDAR-Camera 3D Object Detection.

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving

Improving 3D Object Detection with Channel-wise Transformer

Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds

SWFormer: Sparse Window Transformer for 3D Object Detection in Point Clouds

F-Transformer: Point Cloud Fusion Transformer for Cooperative 3D Object Detection

Transformer-Based Optimized Multimodal Fusion for 3D Object Detection in Autonomous Driving

LiDAR-based 3D Video Object Detection with Foreground Context Modeling and Spatiotemporal Graph Reasoning

Region-proposal Convolutional Network-driven Point Cloud Voxelization and Over-segmentation for 3D Object Detection

F-PVNet: Frustum-Level 3-D Object Detection on Point–Voxel Feature Representation for Autonomous Driving

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention

CenterFormer: Center-based Transformer for 3D Object Detection

FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision Transformer Fusion

DETR4D: Direct Multi-View 3D Object Detection with Sparse Attention