OcTr: Octree-based Transformer for 3D Object Detection

Chao Zhou,Yanan Zhang,Jiaxin Chen,Di Huang

2023-03-22

Abstract:A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask, to fully exploit semantic and geometry clues. Extensive experiments are conducted on the Waymo Open Dataset and KITTI Dataset, and OcTr reaches newly state-of-the-art results.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is: In 3D object detection based on LiDAR, how to capture sufficient features from large-scale 3D scenes, especially for distant or occluded objects. Although recent studies have used Transformers to model long sequences, these methods have failed to find a suitable balance between accuracy and efficiency, resulting in insufficient receptive fields or coarse global correlations. Specifically, the paper points out: - Current methods struggle to ensure both accuracy and efficiency when dealing with large-scale 3D scenes. - Common Transformer methods, while capable of modeling long-range dependencies, have high computational complexity when processing large-scale 3D scenes, leading to performance degradation. - Existing sparse attention mechanisms can reduce computational load but usually only expand the local receptive field and cannot effectively capture global contextual information. To address these issues, the authors propose an Octree-based Transformer (OcTr), aiming to improve 3D object detection through the following approaches: 1. **Dynamic Octree Construction**: By performing self-attention operations on a multi-scale feature pyramid and recursively propagating to the next layer, it captures rich global contextual information while keeping computational complexity manageable. 2. **Hybrid Positional Encoding**: Combining semantic-aware positional encoding and attention masks to fully utilize geometric and semantic cues, enhancing foreground perception. With these designs, OcTr can efficiently capture global contextual information in large-scale 3D scenes, thereby improving detection accuracy and efficiency. Experimental results show that OcTr achieves new state-of-the-art levels on the Waymo Open Dataset and KITTI Dataset, particularly excelling in distant object detection.

OcTr: Octree-based Transformer for 3D Object Detection

SEFormer: Structure Embedding Transformer for 3D Object Detection

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

Anchor-Based Transformer for Temporal LiDAR 3D Object Detection

Object Detection of Occlusion Point Cloud based on Transformer.

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

TSSTDet: Transformation-Based 3-D Object Detection via a Spatial Shape Transformer

DS-Trans: A 3D Object Detection Method Based on a Deformable Spatiotemporal Transformer for Autonomous Vehicles

Li3DeTr: A LiDAR based 3D Detection Transformer

OctFormer: Octree-based Transformers for 3D Point Clouds

Embracing Single Stride 3D Object Detector with Sparse Transformer

MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

CenterFormer: Center-based Transformer for 3D Object Detection

DVST: Deformable Voxel Set Transformer for 3D Object Detection from Point Clouds

HCT-Det: a Hybrid CNN-transformer Architecture for 3D Object Detection from Point Clouds

OCTraN: 3D Occupancy Convolutional Transformer Network in Unstructured Traffic Scenarios

Introducing Depth into Transformer-based 3D Object Detection

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection