Feature‐enhanced representation with transformers for multi‐view stereo

Lintao Xiang,Hujun Yin
DOI: https://doi.org/10.1049/ipr2.13046
IF: 2.3
2024-03-05
IET Image Processing
Abstract:(1) We introduce Vision Transformers to enhance both 2D feature representations and global 3D spatial information aggregation. (2) We design a color‐guided network to refine depth maps. (3) Proposed method achieves competitive performance on both DTU dataset and Tanks and Temples benchmark. Most existing multi‐view stereo (MVS) methods fail to consider global context information in the stage of feature extraction and cost aggregation. As transformers have shown remarkable performance on various vision tasks due to their ability to perceive global contextual information, this paper proposes a transformer‐based feature enhancement network (TF‐MVSNet) to facilitate feature representation learning by combining local features (both 2D and 3D) with long‐range contextual information. To reduce memory consumption of feature matching, the cross‐attention mechanism is leveraged to efficiently construct 3D cost volumes under the epipolar constraint. Additionally, a colour‐guided network is designed to refine depth maps at a coarse stage, hence reducing incorrect depth predictions at a fine stage. Extensive experiments were performed on the DTU dataset and Tanks and Temples (T&T) benchmark and results are reported.
computer science, artificial intelligence,engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?