Abstract:NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by innovatively utilizing NeRF to enhance representation learning. Despite its notable performance, we uncover three decisive shortcomings in its current design, including semantic ambiguity, inappropriate sampling, and insufficient utilization of depth supervision. To combat the aforementioned problems, we present three corresponding solutions: 1) Semantic Enhancement. We project the freely available 3D segmentation annotations onto the 2D plane and leverage the corresponding 2D semantic maps as the supervision signal, significantly enhancing the semantic awareness of multi-view detectors. 2) Perspective-aware Sampling. Instead of employing the uniform sampling strategy, we put forward the perspective-aware sampling policy that samples densely near the camera while sparsely in the distance, more effectively collecting the valuable geometric clues. 3)Ordinal Residual Depth Supervision. As opposed to directly regressing the depth values that are difficult to optimize, we divide the depth range of each scene into a fixed number of ordinal bins and reformulate the depth prediction as the combination of the classification of depth bins as well as the regression of the residual depth values, thereby benefiting the depth learning process. The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and ARKITScenes datasets. Notably, in ScanNetV2, NeRF-Det++ outperforms the competitive NeRF-Det by +1.9% in mAP@0.25 and +3.5% in mAP@0.50$. The code will be publicly at https://github.com/mrsempress/NeRF-Detplusplus.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to address three key issues in NeRF-Det for indoor multi-view 3D detection: 1. **Semantic Ambiguity**: - While NeRF-Det can roughly estimate the spatial position of objects, it often makes mistakes in classification. For example, without semantic guidance, objects of different categories may be misclassified. 2. **Inappropriate Sampling**: - NeRF-Det adopts a uniform sampling strategy, which leads to depth loss in distant areas dominating because these areas have fewer visual cues and larger errors. This unbalanced depth learning results in poor learning performance in nearby areas. 3. **Insufficient Utilization of Depth Supervision**: - Directly regressing depth values is very difficult during the optimization process. NeRF-Det's depth supervision strategy fails to fully utilize valuable visual cues from multi-view images, leading to suboptimal depth learning performance. To address these issues, the authors propose the following solutions: 1. **Semantic Enhancement**: - Introduce a semantic enhancement module that projects freely available 3D segmentation annotations onto the 2D plane and uses the corresponding 2D semantic maps as supervision signals, significantly enhancing the semantic awareness of the multi-view detector. 2. **Perspective-aware Sampling**: - Design a perspective-aware sampling strategy that densely samples near the camera and sparsely samples in the distance, thereby more effectively collecting valuable geometric cues. This method allows different perspectives to focus more on objects that deserve more attention. 3. **Ordinal Residual Depth Supervision**: - Divide the depth range of each scene into a fixed number of ordinal intervals and reformulate depth prediction as a combination of classification of depth intervals and regression of residual depth values. This helps achieve a more stable depth learning process. With these improvements, NeRF-Det++ achieves significant performance enhancements on the ScanNetV2 and ARKITScenes datasets, particularly on the ScanNetV2 dataset, where mAP @0.25 increases by 1.9% and mAP @0.50 increases by 3.5%.

NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth Supervision for Indoor Multi-View 3D Detection

NeRF-DetS: Enhanced Adaptive Spatial-wise Sampling and View-wise Fusion Strategies for NeRF-based Indoor Multi-view 3D Object Detection

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

DaRF: Boosting Radiance Fields from Sparse Inputs with Monocular Depth Adaptation

MVSDet: Multi-View Indoor 3D Object Detection via Efficient Plane Sweeps

NeSLAM: Neural Implicit Mapping and Self-Supervised Feature Tracking With Depth Completion and Denoising

StructNeRF: Neural Radiance Fields for Indoor Scenes With Structural Hints

3DGS-DET: Empower 3D Gaussian Splatting with Boundary Guidance and Box-Focused Sampling for 3D Object Detection

UniDet3D: Multi-dataset Indoor 3D Object Detection

SAID-NeRF: Segmentation-AIDed NeRF for Depth Completion of Transparent Objects

V-MIND: Building Versatile Monocular Indoor 3D Detector with Diverse 2D Annotations

Depth Is All You Need for Monocular 3D Detection

Point-DETR3D: Leveraging Imagery Data with Spatial Point Prior for Weakly Semi-supervised 3D Object Detection

Depth-supervised NeRF: Fewer Views and Faster Training for Free

DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features

Spatial and Semantic Information Enhancement for Indoor 3D Object Detection

Revisiting Monocular 3D Object Detection from Scene-Level Depth Retargeting to Instance-Level Spatial Refinement

Introducing Depth into Transformer-based 3D Object Detection

Ray Denoising: Depth-aware Hard Negative Sampling for Multi-view 3D Object Detection

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries