Abstract:Existing LiDAR-based 3D object detectors usually focus on the single-frame detection, while ignoring the spatiotemporal information in consecutive point cloud frames. In this paper, we propose an end-to-end online 3D video object detector that operates on point cloud sequences. The proposed model comprises a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel Pillar Message Passing Network (PMPNet) is proposed to encode each discrete point cloud frame. It adaptively collects information for a pillar node from its neighbors by iterative message passing, which effectively enlarges the receptive field of the pillar feature. In the latter component, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU) to aggregate the spatiotemporal information, which enhances the conventional ConvGRU with an attentive memory gating mechanism. AST-GRU contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module, which can emphasize the foreground objects and align the dynamic objects, respectively. Experimental results demonstrate that the proposed 3D video object detector achieves state-of-the-art performance on the large-scale nuScenes benchmark.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in LiDAR - based 3D video object detection, mainly including: 1. **Limitations of single - frame detection**: Most of the existing LiDAR - based 3D object detection methods mainly focus on single - frame detection and ignore the spatio - temporal information in continuous point cloud frames. This leads to poor performance when dealing with problems such as occlusion, long - distance and non - uniform sampling. 2. **Utilization of spatio - temporal information**: Continuous point cloud frames contain rich spatio - temporal information, which can be used to improve detection performance. Therefore, how to effectively model the spatial and temporal feature representations of continuous point cloud frames has become a key problem in constructing 3D video object detectors. 3. **Limitations of local features**: Traditional single - frame 3D object detection methods usually only focus on locally aggregated features, resulting in a small receptive field and being unable to effectively capture geometric relationships in a larger range. 4. **Background noise and dynamic object alignment**: In the bird's - eye view, the area occupied by foreground objects (such as cars and pedestrians) is small, and background noise will accumulate, affecting the calculation of new memories. In addition, the alignment of dynamic objects between different frames is also a challenge. To solve these problems, the author proposes an end - to - end online 3D video object detection framework, which mainly includes two components: - **Spatial feature encoding component**: A new graph message passing network (Pillar Message Passing Network, PMPNet) is introduced. Through iterative message passing, the receptive field of each pillar node is enlarged, so as to better capture the rich geometric relationships between different spatial regions. - **Spatio - temporal feature aggregation component**: An Attentive Spatiotemporal Transformer GRU (AST - GRU) with an attention mechanism is proposed. Through the Spatial Transformer Attention (STA) module and the Temporal Transformer Attention (TTA) module, the foreground objects are emphasized and the dynamic objects are aligned respectively, so as to more effectively utilize the spatio - temporal information in continuous point cloud frames. Experimental results show that this method has achieved state - of - the - art performance in the large - scale nuScenes benchmark test.

LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention

Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds

LiDAR-based 3D Video Object Detection with Foreground Context Modeling and Spatiotemporal Graph Reasoning

Anchor-Based Transformer for Temporal LiDAR 3D Object Detection

SPV-SSD: An Anchor-Free 3D Single-Stage Detector with Supervised-PointRendering and Visibility Representation

Dynamic graph transformer for 3D object detection

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

Multi-Scale Spatial Transformer Network for LiDAR-Camera 3D Object Detection.

AGO-Net: Association-Guided 3D Point Cloud Object Detection Network

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

MSIT-Det: Multi-Scale Feature Aggregation with Iterative Transformer Networks for 3D Object Detection

Future Does Matter: Boosting 3D Object Detection with Temporal Motion Estimation in Point Cloud Sequences

An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds

STFormer3D: Spatio-Temporal Transformer Based 3D Object Detection for Intelligent Driving.

HCT-Det: a Hybrid CNN-transformer Architecture for 3D Object Detection from Point Clouds

SP-Net: A Sparse Convolution and Point-Encoding Enhanced Network for 3D Object Detection in LiDAR Point Clouds.

Long-short Range Adaptive Transformer with Dynamic Sampling for 3D Object Detection

Temporal-Channel Transformer for 3D Lidar-Based Video Object Detection in Autonomous Driving

SparseDet: A Simple and Effective Framework for Fully Sparse LiDAR-based 3D Object Detection

Two-stage 3D Object Detection Guided by Position Encoding

LiDAR-Based 3D Temporal Object Detection via Motion-Aware LiDAR Feature Fusion