LiDAR-based Online 3D Video Object Detection with Graph-based Message Passing and Spatiotemporal Transformer Attention

Junbo Yin,Jianbing Shen,Chenye Guan,Dingfu Zhou,Ruigang Yang
DOI: https://doi.org/10.48550/arXiv.2004.01389
2020-04-03
Abstract:Existing LiDAR-based 3D object detectors usually focus on the single-frame detection, while ignoring the spatiotemporal information in consecutive point cloud frames. In this paper, we propose an end-to-end online 3D video object detector that operates on point cloud sequences. The proposed model comprises a spatial feature encoding component and a spatiotemporal feature aggregation component. In the former component, a novel Pillar Message Passing Network (PMPNet) is proposed to encode each discrete point cloud frame. It adaptively collects information for a pillar node from its neighbors by iterative message passing, which effectively enlarges the receptive field of the pillar feature. In the latter component, we propose an Attentive Spatiotemporal Transformer GRU (AST-GRU) to aggregate the spatiotemporal information, which enhances the conventional ConvGRU with an attentive memory gating mechanism. AST-GRU contains a Spatial Transformer Attention (STA) module and a Temporal Transformer Attention (TTA) module, which can emphasize the foreground objects and align the dynamic objects, respectively. Experimental results demonstrate that the proposed 3D video object detector achieves state-of-the-art performance on the large-scale nuScenes benchmark.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve several key problems in LiDAR - based 3D video object detection, mainly including: 1. **Limitations of single - frame detection**: Most of the existing LiDAR - based 3D object detection methods mainly focus on single - frame detection and ignore the spatio - temporal information in continuous point cloud frames. This leads to poor performance when dealing with problems such as occlusion, long - distance and non - uniform sampling. 2. **Utilization of spatio - temporal information**: Continuous point cloud frames contain rich spatio - temporal information, which can be used to improve detection performance. Therefore, how to effectively model the spatial and temporal feature representations of continuous point cloud frames has become a key problem in constructing 3D video object detectors. 3. **Limitations of local features**: Traditional single - frame 3D object detection methods usually only focus on locally aggregated features, resulting in a small receptive field and being unable to effectively capture geometric relationships in a larger range. 4. **Background noise and dynamic object alignment**: In the bird's - eye view, the area occupied by foreground objects (such as cars and pedestrians) is small, and background noise will accumulate, affecting the calculation of new memories. In addition, the alignment of dynamic objects between different frames is also a challenge. To solve these problems, the author proposes an end - to - end online 3D video object detection framework, which mainly includes two components: - **Spatial feature encoding component**: A new graph message passing network (Pillar Message Passing Network, PMPNet) is introduced. Through iterative message passing, the receptive field of each pillar node is enlarged, so as to better capture the rich geometric relationships between different spatial regions. - **Spatio - temporal feature aggregation component**: An Attentive Spatiotemporal Transformer GRU (AST - GRU) with an attention mechanism is proposed. Through the Spatial Transformer Attention (STA) module and the Temporal Transformer Attention (TTA) module, the foreground objects are emphasized and the dynamic objects are aligned respectively, so as to more effectively utilize the spatio - temporal information in continuous point cloud frames. Experimental results show that this method has achieved state - of - the - art performance in the large - scale nuScenes benchmark test.