TBFNT3D: Two-Branch Fusion Network with Transformer for Multimodal Indoor 3D Object Detection

Jun Cheng,Sheng Zhang
DOI: https://doi.org/10.1109/lra.2023.3309133
IF: 5.2
2023-01-01
IEEE Robotics and Automation Letters
Abstract:Indoor 3D object detection based on point clouds has been widely applied for robotics, augmented reality and virtual reality. The point clouds generated from RGB-D cameras are sparse for distant objects, which affects the detection performance. Multimodal 3D object detection can improve the detection performance by fusing features for point clouds and images. RGB images can be converted to dense 3D features, which can be applied as a complement to 3D object detection using only point clouds. We refer to the 3D data transformed from RGB images as estimated 3D data. Therefore, we propose a two-branch fusion network with a transformer for multimodal indoor 3D object detection named TBFNT3D. In TBFNT3D, voxels converted from the point clouds and images are added together to obtain a consistent voxel representation. The features for the voxel space are enriched, and features from different modalities do not require a complex alignment process. To make better use of estimated 3D data, we need to process noise and remove redundant estimated 3D data. The receptive field for 3D sparse convolution is expanded into the 2D image space, which weakens the effect of noise. A bin-based sampling strategy is applied for near objects and distant objects, removing the redundant estimated 3D data. In addition, to fuse the multimodal features efficiently, we apply a deformable transformer to obtain the detection results. Finally, TBFNT3D is evaluated on the SUN RGB-D dataset and ScanNet dataset, and state-of-the-art results are achieved.
robotics
What problem does this paper attempt to address?