Abstract:Monocular 3D detection has drawn much attention from the community due to its low cost and setup simplicity. It takes an RGB image as input and predicts 3D boxes in the 3D space. The most challenging sub-task lies in the instance depth estimation. Previous works usually use a direct estimation method. However, in this paper we point out that the instance depth on the RGB image is non-intuitive. It is coupled by visual depth clues and instance attribute clues, making it hard to be directly learned in the network. Therefore, we propose to reformulate the instance depth to the combination of the instance visual surface depth (visual depth) and the instance attribute depth (attribute depth). The visual depth is related to objects' appearances and positions on the image. By contrast, the attribute depth relies on objects' inherent attributes, which are invariant to the object affine transformation on the image. Correspondingly, we decouple the 3D location uncertainty into visual depth uncertainty and attribute depth uncertainty. By combining different types of depths and associated uncertainties, we can obtain the final instance depth. Furthermore, data augmentation in monocular 3D detection is usually limited due to the physical nature, hindering the boost of performance. Based on the proposed instance depth disentanglement strategy, we can alleviate this problem. Evaluated on KITTI, our method achieves new state-of-the-art results, and extensive ablation studies validate the effectiveness of each component in our method. The codes are released at <a class="link-external link-https" href="https://github.com/SPengLiang/DID-M3D" rel="external noopener nofollow">this https URL</a>.

Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

FD3D: Exploiting Foreground Depth Map for Feature-Supervised Monocular 3D Object Detection

MonoSKD: General Distillation Framework for Monocular 3D Object Detection via Spearman Correlation Coefficient

Towards Efficient 3D Object Detection with Knowledge Distillation

Cross-Modality Knowledge Distillation Network for Monocular 3D Object Detection

Depth-Enhancement Network for Monocular 3D object detection

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View

Depth Is All You Need for Monocular 3D Detection

Representation Disparity-aware Distillation for 3D Object Detection

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection.

Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection.

Densely Constrained Depth Estimator for Monocular 3D Object Detection

Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection

itKD: Interchange Transfer-based Knowledge Distillation for 3D Object Detection

BEVDistill: Cross-Modal BEV Distillation for Multi-View 3D Object Detection

MonoCD: Monocular 3D Object Detection with Complementary Depths