Triangulation Learning Network: from Monocular to Stereo 3D Object Detection

Zengyi Qin,Jinglu Wang,Yan Lu
DOI: https://doi.org/10.48550/arXiv.1906.01193
2019-06-04
Abstract:In this paper, we study the problem of 3D object detection from stereo images, in which the key challenge is how to effectively utilize stereo information. Different from previous methods using pixel-level depth maps, we propose employing 3D anchors to explicitly construct object-level correspondences between the regions of interest in stereo images, from which the deep neural network learns to detect and triangulate the targeted object in 3D space. We also introduce a cost-efficient channel reweighting strategy that enhances representational features and weakens noisy signals to facilitate the learning process. All of these are flexibly integrated into a solid baseline detector that uses monocular images. We demonstrate that both the monocular baseline and the stereo triangulation learning network outperform the prior state-of-the-arts in 3D object detection and localization on the challenging KITTI dataset.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the key challenge in 3D object detection in stereo images, that is, how to effectively utilize stereo information. Specifically, the paper proposes a method named Triangulation Learning Network (TLNet), aiming to explicitly construct the correspondence of the target in stereo images by using 3D anchors, thereby achieving the detection and triangulation of the target object in 3D space. Different from previous methods relying on pixel - level depth maps, this method avoids the computationally intensive pixel - level disparity maps and directly uses 3D anchors to guide the network to learn the triangulation of the target object. In addition, the paper also introduces a cost - effective channel re - weighting strategy to enhance the feature representation, weaken the noise signal, and promote the learning process. The main contributions of the paper include: 1. Proposing a 3D detection baseline model based on monocular images, whose performance can be comparable to that of the state - of - the - art stereo methods. 2. Developing TLNet, which significantly improves the 3D detection and localization performance of the baseline model on the challenging KITTI dataset by using the geometric correlation of stereo images to accurately locate 3D target objects. 3. Introducing a feature re - weighting strategy, which strengthens the information - rich feature channels by measuring the left - right consistency, making the network pay more attention to the key parts of the target object, thus being beneficial to triangulation learning.