Abstract:Few modern 3D object detectors achieve fast inference speed and high accuracy at the same time. To achieve high performance, they usually directly operate on raw point clouds, or convert point clouds to 3D representation and then apply 3D convolution. However, those methods come with sizable computation overhead and complex operations. As for high-speed 2D-representation-based 3D detectors, their performance is still restricted. In this paper, we investigate how to leverage context knowledge to empower the 2D representation of point clouds for computation and memory-efficient 3D object detection with state-of-the-art performance. The proposed encoder has two parts: a context-sensitive point sampling network and a point set learning network. Specifically, our point sampling network samples points with dense localization information. With high-quality sampled points, we are allowed to utilize a deeper point set learning network to aggregate semantic details in a light manner. The proposed encoder is lightweight and very supportive of hardware acceleration like TensorRT and TVM. Extensive experiments on the KITTI benchmark show the proposed encoder called PointCSE outperforms prior real-time encoders by a large margin with 1.5×\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\times$$\end{document} memory reduction; it also achieves state-of-the-art performance with 49 FPS inference speed (4×\documentclass[12pt]{minimal}\usepackage{amsmath}\usepackage{wasysym}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{amsbsy}\usepackage{mathrsfs}\usepackage{upgreek}\setlength{\oddsidemargin}{-69pt}\begin{document}$$\times$$\end{document} speedup on average compared to previous best methods).

PointCSE: Context-sensitive encoders for efficient 3D object detection from point cloud