CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

Guanlin Shen,Jingwei Huang,Zhihua Hu,Bin Wang

2024-04-09

Abstract:This paper introduces CN-RMA, a novel approach for 3D indoor object detection from multi-view images. We observe the key challenge as the ambiguity of image and 3D correspondence without explicit geometry to provide occlusion information. To address this issue, CN-RMA leverages the synergy of 3D reconstruction networks and 3D object detection networks, where the reconstruction network provides a rough Truncated Signed Distance Function (TSDF) and guides image features to vote to 3D space correctly in an end-to-end manner. Specifically, we associate weights to sampled points of each ray through ray marching, representing the contribution of a pixel in an image to corresponding 3D locations. Such weights are determined by the predicted signed distances so that image features vote only to regions near the reconstructed surface. Our method achieves state-of-the-art performance in 3D object detection from multi-view images, as measured by mAP@0.25 and mAP@0.5 on the ScanNet and ARKitScenes datasets. The code and models are released at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue of occlusion encountered in 3D indoor object detection from multi-view images. Specifically, due to the lack of explicit scene geometry information to provide occlusion details, image features may be incorrectly projected into 3D space when detecting 3D objects from multi-view images, leading to inaccurate detection. To tackle this challenge, the paper proposes a new method called CN-RMA, which effectively handles occlusion in complex environments by combining a 3D reconstruction network and a 3D object detection network, and introducing a Ray Marching Aggregation (RMA) module. This approach improves the accuracy of 3D object detection. The main contributions of the paper include: 1. Establishing a seamless connection between the multi-view 3D reconstruction network and the 3D object detection network, better utilizing image features in 3D space to enhance performance. 2. Proposing an innovative occlusion-aware aggregation method, RMA, which uses the reconstructed scene TSDF to address complex occlusion issues. 3. Adopting a pre-training and fine-tuning scheme to achieve state-of-the-art performance in indoor 3D object detection from multi-view images.

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoors Object Detection from Multi-view Images

CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

Object-aware Semantic Mapping of Indoor Scenes Using Octomap

2D-to-3D Projection for Monocular and Multi-View 3D Multi-class Object Detection in Indoor Scenes

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection

From Multi-View to Hollow-3D: Hallucinated Hollow-3D R-CNN for 3D Object Detection

Hybrid 3D Reconstruction of Indoor Scenes Integrating Object Recognition

Neural 3D Scene Reconstruction with the Manhattan-world Assumption

M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth Supervision for Indoor Multi-View 3D Detection

Neural 3D Scene Reconstruction with Indoor Planar Priors

RayFormer: Improving Query-Based Multi-Camera 3D Object Detection via Ray-Centric Strategies

Refined Voting and Scene Feature Fusion for 3D Object Detection in Point Clouds

DMSC-Net: A deep Multi-Scale context network for 3D object detection of indoor point clouds

CAGroup3D: Class-Aware Grouping for 3D Object Detection on Point Clouds

Three-Dimensional Reconstruction of Indoor Scenes Based on Implicit Neural Representation

AGO-Net: Association-Guided 3D Point Cloud Object Detection Network

MMAF-Net: Multi-view multi-stage adaptive fusion for multi-sensor 3D object detection