Abstract:This paper proposes a novel 3D reconstruction network TMSDNet, which uses transformers' capability of strong feature extraction and processing of relative order between features to obtain voxel features. With the multiple bypass and RDAB in MSRAM, TMSDNet can utilize the information about the global shape and local details in the deep multi‐scale representation of the object in the voxel feature domain to further improve the performance. Extensive experiments show TMSDNet has better reconstruction performance, fewer parameters and competitive inference time. 3D reconstruction is a long‐standing problem. Recently, a number of studies have emerged that utilize transformers for 3D reconstruction, and these approaches have demonstrated strong performance. However, transformer‐based 3D reconstruction methods tend to establish the transformation relationship between the 2D image and the 3D voxel space directly using transformers or rely solely on the powerful feature extraction capabilities of transformers. They ignore the crucial role played by deep multi‐scale representation of the object in the voxel feature domain, which can provide extensive global shape and local detail information about the object in a multi‐scale manner. In this article, we propose a novel framework TMSDNet (transformer with multi‐scale dense network) for single‐view and multi‐view 3D reconstruction with transformer to solve this problem. Based on our well‐designed combined‐transformer Block, which is canonical encoder–decoder architecture, voxel features with spatial order can be extracted from the input image, which are used to further extract multi‐scale global features in parallel using a multi‐scale residual attention module. Furthermore, a residual dense attention block is introduced for deep local features extraction and adaptive fusion. Finally, the reconstructed objects are produced with the voxel reconstruction block. Experiment results on the benchmarks such as ShapeNet and Pix3D datasets demonstrate that TMSDNet outperforms the existing state‐of‐the‐art reconstruction methods substantially.

Multi-scale Latent Feature-Aware Network for Logical Partition Based 3D Voxel Reconstruction

3D Multiple-Contextual ROI-Attention Network for Efficient and Accurate Volumetric Medical Image Segmentation.

Voxel-based 3D Detection and Reconstruction of Multiple Objects from a Single Image

3D Reconstruction and Semantic Segmentation Method Combining PointNet and 3D-Lmnet from Single Image

DLGAN: Depth-Preserving Latent Generative Adversarial Network for 3D Reconstruction

Latent Feature-Aware and Local Structure-Preserving Network for 3D Completion from a Single Depth View

AVFP-MVX: Multimodal VoxelNet with Attention Mechanism and Voxel Feature Pyramid

A Spatial Relationship Preserving Adversarial Network for 3D Reconstruction from a Single Depth View

PSVMLP: Point and Shifted Voxel MLP for 3D Deep Learning

MPVNN: Multi-resolution Point-Voxel Non-parametric Network for 3D Point Cloud Processing

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

Multi-granularity Relationship Reasoning Network for High-Fidelity 3D Shape Reconstruction

Multi-scale Edge-guided Learning for 3D Reconstruction

ADR-MVSNet: A cascade network for 3D point cloud reconstruction with pixel occlusion

TMSDNet: Transformer with multi‐scale dense network for single and multi‐view 3D reconstruction

3D-Mask-GAN:Unsupervised Single-View 3D Object Reconstruction

3D Reconstruction for Multi-view Objects

SparseVoxNet: 3-D Object Recognition With Sparsely Aggregation of 3-D Dense Blocks

Component-Aware High-Resolution 3D Object Reconstruction

Object Reconstruction Based on Attentive Recurrent Network from Single and Multiple Images

3D Voxel Reconstruction from Single-View Image Based on Cross-Domain Feature Fusion