Playing to Vision Foundation Model's Strengths in Stereo Matching

Chuang-Wei Liu,Qijun Chen,Rui Fan
2024-04-09
Abstract:Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.
Computer Vision and Pattern Recognition,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
This paper mainly discusses how to improve stereo matching technology by leveraging the advantages of Vision Foundation Models (VFM), which is a key technology for 3D environment perception in autonomous driving. Traditionally, Convolutional Neural Networks (CNNs) have been the main tool for feature extraction, but researchers have started to turn to Transformer-based VFM and pretrain them on large-scale unlabeled datasets through self-supervision. Although VFM performs well in information-rich and general visual feature extraction, its performance in geometric visual tasks such as stereo matching needs to be improved. The paper proposes a ViT Adapter Structure (ViTAS) consisting of three modules: spatial differentiation, patch attention fusion, and cross attention. These modules are used to initialize the feature pyramid, aggregate stereo and multiscale contextual information, and generate fine-grained features. The authors construct a network called ViTAStereo, which combines ViTAS with a cost volume-based stereo matching backend, and achieves the highest ranking on the KITTI Stereo 2012 dataset, outperforming the second-place StereoBase network by approximately 7.9% in terms of error pixel percentage. The research also points out that although some state-of-the-art networks attempt to bypass cost volume construction and directly use Transformers for dense prediction, this design has limited generalization ability on new data. Therefore, the goal of ViTAS is to leverage the strengths of VFM and build cost volume using effective adapters that fully utilize its general depth features, rather than relying solely on these features for opaque disparity regression. The main contributions of the paper include: the first attempt to apply VFM to stereo matching; the proposal of a lightweight PAFM module that effectively learns local and global feature weighting parameters separately; the discussion of the limitations of networks relying solely on the cross attention mechanism in stereo matching; and the demonstration of superior performance and generalization ability of ViTAStereo on multiple public datasets. In conclusion, the paper aims to improve the performance and generalization ability of stereo matching by leveraging the depth features of VFM. It proposes a new adapter structure and validates its effectiveness through experiments.