Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention

Ajith Balakrishnan,Sreeja S,Linu Shine

DOI: https://doi.org/10.1145/3702250.3702292

2024-12-01

Abstract:Generating 3D models from multi-view 2D RGB images has gained significant attention, extending the capabilities of technologies like Virtual Reality, Robotic Vision, and human-machine interaction. In this paper, we introduce a hybrid strategy combining CNNs and transformers, featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network, trained using a novel Joint Train Separate Optimization (JTSO) algorithm. Encoded features from unordered inputs are transformed into an enhanced feature map by the self-attention layer, decoded into an initial 3D volume, and further refined. Our network generates 3D voxels from single or multiple 2D images from arbitrary viewpoints. Performance evaluations using the ShapeNet datasets show that our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction, achieving the highest mean intersection over union (IOU) scores, surpassing other models by 4.2% in single-view reconstruction.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to generate high - precision 3D models from multi - view 2D RGB images. Specifically, the researchers proposed a new network architecture, Refine3DNet, aiming to improve the accuracy and efficiency of 3D object reconstruction by combining the advantages of convolutional neural networks (CNNs) and transformer models (Transformers). The paper mentions that traditional 3D reconstruction methods have some limitations when dealing with multi - view images, such as difficulty in feature fusion, dependence on the order of input images, and insufficient ability to handle sparse or noisy data. To solve these problems, Refine3DNet introduced the self - attention mechanism to effectively aggregate image features from different views, and developed a new training algorithm named JTSO (Joint Train Separate Optimize) to improve training efficiency and model performance. The main contributions of the paper include: - Proposing an innovative CNN architecture that can generate 3D voxels from one or more 2D images. - Designing a new self - attention mechanism to effectively aggregate features from unordered images. - Developing a three - stage training algorithm JTSO that can independently update network parameters at different training stages. - Conducting a comprehensive evaluation using the ShapeNetCore dataset, and the results show that this method outperforms the existing techniques in single - view and multi - view 3D reconstruction, especially when the number of input images is small. Through these improvements, Refine3DNet has achieved a significant improvement in the accuracy and robustness of 3D model reconstruction, especially when dealing with complex shapes and small features.

Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention

A Coarse-to-Fine Transformer-Based Network for 3D Reconstruction from Non-Overlapping Multi-View Images

3D Reconstruction and Semantic Segmentation Method Combining PointNet and 3D-Lmnet from Single Image

Adaptive fish school search optimized resnet for multi-view 3D objects reconstruction

3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

FineRecon: Depth-aware Feed-forward Network for Detailed 3D Reconstruction

Enhanced Multi-Scale Attention-Driven 3D Human Reconstruction from Single Image

Enhanced multi view 3D reconstruction with improved MVSNet

ARShape-Net: Single-View Image Oriented 3D Shape Reconstruction with an Adversarial Refiner

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

ReFu: Refine and Fuse the Unobserved View for Detail-Preserving Single-Image 3D Human Reconstruction

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

DV-Net: Dual-view Network for 3D Reconstruction by Fusing Multiple Sets of Gated Control Point Clouds

Single-view 3D reconstruction via dual attention

Reinforced Axial Refinement Network for Monocular 3D Object Detection

2L3: Lifting Imperfect Generated 2D Images into Accurate 3D

Deep Single-View 3D Object Reconstruction with Visual Hull Embedding

VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion

Fast 3D Pose Refinement with RGB Images

3D Reconstruction for Multi-view Objects

Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view Images