Transformer-guided Feature Pyramid Network for Multi-View Stereo

Lina Wang,Jiangfeng She,Zhao Qiang,Xiang Wen,Yuzheng Guan
DOI: https://doi.org/10.1016/j.neucom.2024.129066
IF: 6
2024-01-01
Neurocomputing
Abstract:Feature Pyramid Network (FPN) is widely used in Multi-View Stereo (MVS) to extract multi-scale features, effectively enhancing both the quality and efficiency of reconstruction. However, the multi-scale features are simply fused by element-wise addition, ignoring the contextual relationship between them. To solve this problem, a Transformer-guided Feature Pyramid Network (TFPN-MVSNet) is proposed, which uses the self-attention mechanism to aggregate the long-range context information between multi-scale features. Firstly, the feature enhancement module is designed based on internal attention mechanism to calculate the importance of different parts within the feature sequence, which pays more attention to the features related to reconstruction. The feature aggregation module is designed based on cross attention mechanism to focus on the contextual information of multi-scale features. Secondly, the two modules are introduced into the multi-scale feature extraction network to enhance the three-stage features from coarse to fine. Finally, the Spatial Weighted Cost Volume Aggregation method (SWCA)uses the spatial attention mechanism to calculate the pixel weights at the same position in different views, which suppresses the negative impact of occluded pixels. Evaluated on the DTU dataset, the overall error is reduced by 0.062 mm compared to CasMVSNet, and by 0.012 mm and 0.019 mm compared to similar methods TransMVSNet and CostFormer, respectively. Our method also achieves competitive results on the Tanks & Temples and ETH3D datasets.
What problem does this paper attempt to address?