A Transformer-Based Architecture for High-Resolution Stereo Matching

Di Jia,Peng Cai,Qian Wang,Ninghua Yang
DOI: https://doi.org/10.1109/tci.2024.3350884
IF: 5.4
2024-02-03
IEEE Transactions on Computational Imaging
Abstract:The Transformer architecture is now widely used due to its superior parallel computing and global modelling capabilities. In this paper, We build a dense Feature Extraction Transformer (FET) for stereo matching tasks, incorporating Transformer and convolution blocks. In stereo matching tasks, FET has three advantages: 1) For stereo image pairs with high resolution, Transformer blocks joined with Spatial pyramidal pooling windows can obtain a wide range of contextual representations while maintaining linear computational complexity; 2) We use convolution and transposed convolution blocks to respectively implement overlapping patch embedding, which allows features to capture enough proximity information to facilitate fine-grained matching. 3) FET creatively utilizes the jump-query strategy to apply the transformer encoder and decoder structures to feature extraction tasks simultaneously. Furthermore, to obtain an architecture more thoroughly based on Transformer, we use STTR's (Li et al., 2021) attention-based pixel-matching strategy. Our model obtained 0.32 end-point error and 0.89% 3-px error on the Scene Flow benchmark (30.95% point and 29.36% point absolute improvement compared to STTR). On the KITTI 2015 benchmark, our model obtained 1.80 D1-bg in Estimated pixels (1.57 points of error reduction compared to STTR).
engineering, electrical & electronic,imaging science & photographic technology
What problem does this paper attempt to address?