Abstract:This paper aims at developing a faster and more accurate solution to the amodal 3D object detection problem for indoor scenarios. The solution is achieved through a novel neural network structure which takes a pair of RGB-D images as input and delivers oriented 3D bounding boxes as the output. Such network, named 3D-SSD, has two components: hierarchical feature fusion and multi-layer prediction. The hierarchical feature fusion combines multi-scale appearance and geometric features learned from RGB-D images, which is later utilized in the multi-layer prediction for object detection. Both the accuracy and the efficiency can be improved by exploiting 2.5D representations in a synergistic way. To specifically address the shape variance of different objects, a set of 3D anchor boxes with varying physical sizes are attached to every location on the prediction layers. While testing, the category scores for 3D anchor boxes are generated with adjusted positions, sizes and orientations, leading to the final detections using non-maximum suppression. Comprehensive experiments have been performed on publicly accessible dataset of SUN RGB-D and NYUV2. The results show the proposed algorithm is the first 3D detector that runs in near real-time on the challenging datasets with competitive performance to the state-of-the-art methods. The 3D-SSD gets 37.1% mAP on the SUN RGB-D dataset at around 5.6 fps, which outperforms the state-of-the-art Deep Sliding Shape by 10.2% mAP and around 109 x faster. For an efficient model setting with a rate of 9.3 fps, 3D-SSD still gets an accuracy of 37% on mAP. Further, experiments also suggest the proposed approach achieves comparable accuracy and is about 477 x faster than the state-of-art method on the NYUv2 dataset even with a smaller input image size. (C) 2019 Published by Elsevier B.V.

2.5d Convolution For Rgb-D Semantic Segmentation

Improving RGB-D Face Recognition via Transfer Learning from a Pretrained 2D Network.

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Malleable 2.5D Convolution: Learning Receptive Fields Along the Depth-Axis for RGB-D Scene Parsing

Depth-aware CNN for RGB-D Segmentation

Spatial Information Guided Convolution for Real-Time RGBD Semantic Segmentation

Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks

RGB×D: Learning Depth-Weighted RGB Patches for RGB-D Indoor Semantic Segmentation

An RGB-D Fusion Based Semantic Segmentation Algorithm Based on Neighborhood Metric Relations

DDNet: Depth Dominant Network for Semantic Segmentation of RGB-D Images

High-Resolution Remote Sensing Image Semantic Segmentation Method Based on Improved Encoder-Decoder Convolutional Neural Network

Spatial-information Guided Adaptive Context-aware Network for Efficient RGB-D Semantic Segmentation

TCANet: three-stream coordinate attention network for RGB-D indoor semantic segmentation

Multi-resolution Cascaded Network with Depth-similar Residual Module for Real-time Semantic Segmentation on RGB-D Images.

RGB-D Salient Object Detection via 3D Convolutional Neural Networks

Dilated Nearest-Neighbor Encoding for 3D Semantic Segmentation of Point Clouds

Coupling Two-Stream Rgb-D Semantic Segmentation Network By Idempotent Mappings

Depth-Adapted CNNs for RGB-D Semantic Segmentation

Random 2.5D U-net for Fully 3D Segmentation

Attention-based Dual Supervised Decoder for RGBD Semantic Segmentation

Anisotropic Convolutional Neural Networks for RGB-D based Semantic Scene Completion