Coatrsnet: Fully Exploiting Convolution and Attention for Stereo Matching by Region Separation

Junda Cheng,Gangwei Xu,Peng Guo,Xin Yang
DOI: https://doi.org/10.1007/s11263-023-01872-0
IF: 13.369
2023-01-01
International Journal of Computer Vision
Abstract:Stereo matching is a fundamental technique for many vision and robotics applications. State-of-the-art methods either employ convolutional neural networks with spatially-shared kernels or utilize content-dependent interactions (e.g., local or global attention) to augment convolution operations. Despite of great improvements being made, existing methods could either suffer from a high computational cost arising from global attention operations or a suboptimal performance at edge regions due to spatially-shared convolutions. In this paper, we propose a CoAtRS stereo matching method to exert the complementary advantages of convolution and attention to the full via region separation. Our method can adaptively adopt the most suitable feature extraction and aggregation patterns for smooth and edge regions with less computational cost. In addition, we propose D-global attention which performs global filtering on the disparity dimension to better fuse cost volumes of different regions and alleviate the locality defects of convolutions. Our CoAtRS stereo matching method can also be embedded conveniently in various existing 3D CNN stereo networks. The resulting networks can achieve significant improvements in terms of both accuracy and efficiency. Furthermore, we design an accurate network (named CoAtRSNet) which achieves the state-of-the-art results on five public datasets. At the time of writing, CoAtRSNet ranks 1st–3rd on all the metrics published on the ETH3D website, ranks 2nd on Scene Flow, and ranks 1st for the Root-Mean-Square metric, 2nd for the average error metric and 3rd for the bad 0.5 metric on the Middlebury benchmark.
What problem does this paper attempt to address?