Abstract:Facing the significant challenge of 3D object detection in complex weather conditions and road environments, existing algorithms based on single-frame point cloud data struggle to achieve desirable results. These methods typically focus on spatial relationships within a single frame, overlooking the semantic correlations and spatiotemporal continuity between consecutive frames. This leads to discontinuities and abrupt changes in the detection outcomes. To address this issue, this paper proposes a multi-frame 3D object detection algorithm based on a deformable spatiotemporal Transformer. Specifically, a deformable cross-scale Transformer module is devised, incorporating a multi-scale offset mechanism that non-uniformly samples features at different scales, enhancing the spatial information aggregation capability of the output features. Simultaneously, to address the issue of feature misalignment during multi-frame feature fusion, a deformable cross-frame Transformer module is proposed. This module incorporates independently learnable offset parameters for different frame features, enabling the model to adaptively correlate dynamic features across multiple frames and improve the temporal information utilization of the model. A proposal-aware sampling algorithm is introduced to significantly increase the foreground point recall, further optimizing the efficiency of feature extraction. The obtained multi-scale and multi-frame voxel features are subjected to an adaptive fusion weight extraction module, referred to as the proposed mixed voxel set extraction module. This module allows the model to adaptively obtain mixed features containing both spatial and temporal information. The effectiveness of the proposed algorithm is validated on the KITTI, nuScenes, and self-collected urban datasets. The proposed algorithm achieves an average precision improvement of 2.1% over the latest multi-frame-based algorithms.

Deformable Template Network (DTN) for Object Detection

DSN: A New Deformable Subnetwork for Object Detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

TSO-DETR: A Network for Small Object Detection of Cervical Cells in TCT Smear

DTM: Deformable Template Matching

Learning a hierarchical deformable template for rapid deformable object parsing

Deformable DETR: Deformable Transformers for End-to-End Object Detection

DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field

Feature Transform Correlation Network for Object Detection.

DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection

An improved object detection algorithm based on multi-scaled and deformable convolutional neural networks

DeepID-Net: multi-stage and deformable deep convolutional neural networks for object detection

Cross stage partial connections based weighted Bi-directional feature pyramid and enhanced spatial transformation network for robust object detection

DVST: Deformable Voxel Set Transformer for 3D Object Detection from Point Clouds

Unsupervised Template-assisted Point Cloud Shape Correspondence Network

Deep Deformation Network for Object Landmark Localization

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

Object Tracking Algorithm Based on Integrated Multi-Scale Templates Guided by Judgment Mechanism

DPT: Deformable Patch-based Transformer for Visual Recognition

Deformable ConvNet with Aspect Ratio Constrained NMS for Object Detection in Remote Sensing Imagery