Abstract:This paper proposes a novel 3D reconstruction network TMSDNet, which uses transformers' capability of strong feature extraction and processing of relative order between features to obtain voxel features. With the multiple bypass and RDAB in MSRAM, TMSDNet can utilize the information about the global shape and local details in the deep multi‐scale representation of the object in the voxel feature domain to further improve the performance. Extensive experiments show TMSDNet has better reconstruction performance, fewer parameters and competitive inference time. 3D reconstruction is a long‐standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer‐based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi‐scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi‐scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi‐scale dense network) for single‐view and multi‐view 3D reconstruction with transformer to solve this problem. Based on our well‐designed combined‐transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi‐scale global features in parallel using a multi‐scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state‐of‐the‐art reconstruction methods substantially.

Complementary spatial transformer network for real-time 3D object recognition

Spatial Transformer for 3D Point Clouds

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

DVST: Deformable Voxel Set Transformer for 3D Object Detection from Point Clouds

Cross Modal Transformer: Towards Fast and Robust 3D Object Detection

Improving 3D Object Detection with Channel-wise Transformer

DeSTNet: Densely Fused Spatial Transformer Networks

YOLO-DCTI: Small Object Detection in Remote Sensing Base on Contextual Transformer Enhancement

T3DNet: Compressing Point Cloud Models for Lightweight 3D Recognition

TMSDNet: Transformer with multi‐scale dense network for single and multi‐view 3D reconstruction

SRCN3D: Sparse R-CNN 3D for Compact Convolutional Multi-View 3D Object Detection and Tracking

Long-short Range Adaptive Transformer with Dynamic Sampling for 3D Object Detection

TSSTDet: Transformation-Based 3-D Object Detection via a Spatial Shape Transformer

TBFNT3D: Two-Branch Fusion Network with Transformer for Multimodal Indoor 3D Object Detection

An Efficient 3-D Point Cloud Place Recognition Approach Based on Feature Point Extraction and Transformer

CMT: Convolutional Neural Networks Meet Vision Transformers

SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

HCT-Det: a Hybrid CNN-transformer Architecture for 3D Object Detection from Point Clouds

SCTransNet: Spatial-channel Cross Transformer Network for Infrared Small Target Detection

3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification

Multi-Correlation Siamese Transformer Network with Dense Connection for 3D Single Object Tracking