NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth Supervision for Indoor Multi-View 3D Detection

Chenxi Huang,Yuenan Hou,Weicai Ye,Di Huang,Xiaoshui Huang,Binbin Lin,Deng Cai,Wanli Ouyang
2024-02-22
Abstract:NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by innovatively utilizing NeRF to enhance representation learning. Despite its notable performance, we uncover three decisive shortcomings in its current design, including semantic ambiguity, inappropriate sampling, and insufficient utilization of depth supervision. To combat the aforementioned problems, we present three corresponding solutions: 1) Semantic Enhancement. We project the freely available 3D segmentation annotations onto the 2D plane and leverage the corresponding 2D semantic maps as the supervision signal, significantly enhancing the semantic awareness of multi-view detectors. 2) Perspective-aware Sampling. Instead of employing the uniform sampling strategy, we put forward the perspective-aware sampling policy that samples densely near the camera while sparsely in the distance, more effectively collecting the valuable geometric clues. 3)Ordinal Residual Depth Supervision. As opposed to directly regressing the depth values that are difficult to optimize, we divide the depth range of each scene into a fixed number of ordinal bins and reformulate the depth prediction as the combination of the classification of depth bins as well as the regression of the residual depth values, thereby benefiting the depth learning process. The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and ARKITScenes datasets. Notably, in ScanNetV2, NeRF-Det++ outperforms the competitive NeRF-Det by +1.9% in mAP@0.25 and +3.5% in mAP@0.50$. The code will be publicly at https://github.com/mrsempress/NeRF-Detplusplus.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to address three key issues in NeRF-Det for indoor multi-view 3D detection: 1. **Semantic Ambiguity**: - While NeRF-Det can roughly estimate the spatial position of objects, it often makes mistakes in classification. For example, without semantic guidance, objects of different categories may be misclassified. 2. **Inappropriate Sampling**: - NeRF-Det adopts a uniform sampling strategy, which leads to depth loss in distant areas dominating because these areas have fewer visual cues and larger errors. This unbalanced depth learning results in poor learning performance in nearby areas. 3. **Insufficient Utilization of Depth Supervision**: - Directly regressing depth values is very difficult during the optimization process. NeRF-Det's depth supervision strategy fails to fully utilize valuable visual cues from multi-view images, leading to suboptimal depth learning performance. To address these issues, the authors propose the following solutions: 1. **Semantic Enhancement**: - Introduce a semantic enhancement module that projects freely available 3D segmentation annotations onto the 2D plane and uses the corresponding 2D semantic maps as supervision signals, significantly enhancing the semantic awareness of the multi-view detector. 2. **Perspective-aware Sampling**: - Design a perspective-aware sampling strategy that densely samples near the camera and sparsely samples in the distance, thereby more effectively collecting valuable geometric cues. This method allows different perspectives to focus more on objects that deserve more attention. 3. **Ordinal Residual Depth Supervision**: - Divide the depth range of each scene into a fixed number of ordinal intervals and reformulate depth prediction as a combination of classification of depth intervals and regression of residual depth values. This helps achieve a more stable depth learning process. With these improvements, NeRF-Det++ achieves significant performance enhancements on the ScanNetV2 and ARKITScenes datasets, particularly on the ScanNetV2 dataset, where mAP @0.25 increases by 1.9% and mAP @0.50 increases by 3.5%.