Abstract:In this paper we propose an approach for monocular 3D object detection from a single RGB image, which leverages a novel disentangling transformation for 2D and 3D detection losses and a novel, self-supervised confidence score for 3D bounding boxes. Our proposed loss disentanglement has the twofold advantage of simplifying the training dynamics in the presence of losses with complex interactions of parameters, and sidestepping the issue of balancing independent regression terms. Our solution overcomes these issues by isolating the contribution made by groups of parameters to a given loss, without changing its nature. We further apply loss disentanglement to another novel, signed Intersection-over-Union criterion-driven loss for improving 2D detection results. Besides our methodological innovations, we critically review the AP metric used in KITTI3D, which emerged as the most important dataset for comparing 3D detection results. We identify and resolve a flaw in the 11-point interpolated AP metric, affecting all previously published detection results and particularly biases the results of monocular 3D detection. We provide extensive experimental evaluations and ablation studies on the KITTI3D and nuScenes datasets, setting new state-of-the-art results on object category car by large margins.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges in monocular 3D object detection. Specifically, the author aims to improve 2D and 3D detection losses by introducing a novel disentangling transformation, and proposes a self - supervised confidence scoring method to enhance the prediction accuracy of 3D bounding boxes. In addition, the author also re - examines the widely - used average precision (AP) evaluation metric in the KITTI3D dataset, discovers the existing flaws and proposes a correction plan. ### Main problems solved 1. **Interaction among complex parameters**: - In the monocular 3D object detection task, there are complex interaction relationships among the parameters of 2D and 3D detection losses, which makes the optimization in the training process difficult. To solve this problem, the author introduces the disentangling transformation to separate the contributions of different parameter groups to the loss, thereby simplifying the training dynamics and avoiding the balance problem between independent regression terms. - Formula representation: \[ L_{\text{dis}}(y, \hat{y})=\sum_{j = 1}^{k}L(\psi(\theta_j,\hat{\theta}_{-j}),\hat{y}), \] where \(L\) is the original loss function, \(\psi\) is a function that maps the network output to the target space, and \(\theta_j\) and \(\hat{\theta}_{-j}\) represent the \(j\)-th group of parameters and other parameters respectively. 2. **Confidence scoring of 3D bounding boxes**: - To improve the confidence scoring of 3D bounding boxes, the author introduces a new self - supervised method, which optimizes by converting the 3D detection loss into a confidence score within the probability range. - Formula representation: \[ \hat{p}_{3D|2D}=e^{-\frac{1}{T}L_{bb}^{3D}(B,\hat{B})}, \] where \(T>0\) is the temperature parameter, and \(L_{bb}^{3D}(B,\hat{B})\) is the 3D bounding box regression loss. 3. **Flaws in the KITTI3D AP metric**: - The author discovers that the 11 - point interpolated average precision (AP) metric used in the KITTI3D dataset has a major flaw, that is, a high AP score can be obtained using a single high - confidence detection result, which leads to an overestimation of the model performance. - For this reason, the author proposes a corrected AP calculation method to evaluate the model performance more accurately. ### Summary This paper significantly improves the performance of monocular 3D object detection by introducing the disentangling transformation and the self - supervised confidence scoring method, and reveals the shortcomings of the existing evaluation methods through a critical review of the AP metric in the KITTI3D dataset, providing an important reference for future research.

Disentangling Monocular 3D Object Detection

Leveraging Front and Side Cues for Occlusion Handling in Monocular 3D Object Detection

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

Monocular 3D Detection With Geometric Constraint Embedding and Semi-Supervised Training

Towards Generalization Across Depth for Monocular 3D Object Detection

Center3D: Center-based Monocular 3D Object Detection with Joint Depth Understanding

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection.

Monocular 3D object detection via estimation of paired keypoints for autonomous driving

Kinematic 3D Object Detection in Monocular Video

MonoDistill: Learning Spatial Features for Monocular 3D Object Detection

MonoNext: A 3D Monocular Object Detection with ConvNext

MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection

Depth Dynamic Center Difference Convolutions for Monocular 3D Object Detection.

Learning Depth-Guided Convolutions for Monocular 3D Object Detection

Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver

Accurate Monocular 3D Object Detection Via Color-Embedded 3D Reconstruction for Autonomous Driving.

ABC: Aligning Binary Centers for Single-Stage Monocular 3D Object Detection

Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation

Reinforced Axial Refinement Network for Monocular 3D Object Detection

Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving