Abstract:This paper aims at developing a faster and more accurate solution to the amodal 3D object detection problem for indoor scenarios. The solution is achieved through a novel neural network structure which takes a pair of RGB-D images as input and delivers oriented 3D bounding boxes as the output. Such network, named 3D-SSD, has two components: hierarchical feature fusion and multi-layer prediction. The hierarchical feature fusion combines multi-scale appearance and geometric features learned from RGB-D images, which is later utilized in the multi-layer prediction for object detection. Both the accuracy and the efficiency can be improved by exploiting 2.5D representations in a synergistic way. To specifically address the shape variance of different objects, a set of 3D anchor boxes with varying physical sizes are attached to every location on the prediction layers. While testing, the category scores for 3D anchor boxes are generated with adjusted positions, sizes and orientations, leading to the final detections using non-maximum suppression. Comprehensive experiments have been performed on publicly accessible dataset of SUN RGB-D and NYUV2. The results show the proposed algorithm is the first 3D detector that runs in near real-time on the challenging datasets with competitive performance to the state-of-the-art methods. The 3D-SSD gets 37.1% mAP on the SUN RGB-D dataset at around 5.6 fps, which outperforms the state-of-the-art Deep Sliding Shape by 10.2% mAP and around 109 x faster. For an efficient model setting with a rate of 9.3 fps, 3D-SSD still gets an accuracy of 37% on mAP. Further, experiments also suggest the proposed approach achieves comparable accuracy and is about 477 x faster than the state-of-art method on the NYUv2 dataset even with a smaller input image size. (C) 2019 Published by Elsevier B.V.

Modality-specific and hierarchical feature learning for RGB-D hand-held object recognition

MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition.

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Large-Margin Multi-Modal Deep Learning for RGB-D Object Recognition

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Semi-supervised Learning for RGB-D Object Recognition.

Multimodal deep learning for robust RGB-D object recognition

Human Action Recognition with Contextual Constraints Using a RGB-D Sensor

Multi-modal Deep Feature Learning for RGB-D Object Detection

Lightweight Multi-modal Representation Learning for RGB Salient Object Detection

Multi-Task and Multi-Modal Learning for RGB Dynamic Gesture Recognition

Depth Cue Enhancement and Guidance Network for RGB-D Salient Object Detection

Robust Multiview Feature Learning for RGB-D Image Understanding

Multi-Modal Unsupervised Feature Learning For Rgb-D Scene Labeling

ASK: Adaptively Selecting Key Local Features for RGB-D Scene Recognition

RGB-D-Based Object Recognition Using Multimodal Convolutional Neural Networks: A Survey

Depth CNNs for RGB-D scene recognition: learning from scratch better than transferring from RGB-CNNs

Depth Images Could Tell Us More: Enhancing Depth Discriminability for RGB-D Scene Recognition

Recognizing Multi-View Objects with Occlusions Using a Deep Architecture

RGB-D Tracking Via Hierarchical Modality Aggregation and Distribution Network.

Robust 3D Hand Detection from a Single RGB-D Image in Unconstrained Environments