MTD-MVSNet: Multi-view Stereo Network with Multi-scale Transformer and Dual Attention

Yu Liang,Dongxu Duan,Yuhong Yuan,Kai Zhang
DOI: https://doi.org/10.1145/3655497.3655523
2024-01-01
Abstract:In this paper, we introduce a novel multi-view stereo (MVS) method named MTD-MVSNet, addressing challenges in accurate depth estimation, particularly in low-texture regions and inaccurate feature matching. MVS is crucial in computer vision for reconstructing 3D scenes from multiple images captured at diverse viewpoints. To overcome issues related to feature extraction and information loss, we propose the Multi-scale Transformer and Large Kernel Attention (MTL) method. MTL incorporates a multi-scale transformer token block and large kernel convolution, enhancing feature capture from local to global scales. In accordance with the coarse-to-fine MVS pipeline, Multi-Level Transformer (MTL) is integrated into the construction of a multi-stage feature extractor. Furthermore, we introduce Dual-Channel Attention Cost Volume Aggregation (DCVA) to strengthen the efficiency of cost volume construction. DCVA incorporates dual attentions in both feature and depth channels, thereby improving the consistency fusion of source volumes. The experimental results highlight the superior performance of our method compared to the latest approaches, achieving accuracy and completeness levels of 0.316 and 0.293 respectively.
What problem does this paper attempt to address?