Abstract:Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.

What problem does this paper attempt to address?

This paper mainly discusses how to improve stereo matching technology by leveraging the advantages of Vision Foundation Models (VFM), which is a key technology for 3D environment perception in autonomous driving. Traditionally, Convolutional Neural Networks (CNNs) have been the main tool for feature extraction, but researchers have started to turn to Transformer-based VFM and pretrain them on large-scale unlabeled datasets through self-supervision. Although VFM performs well in information-rich and general visual feature extraction, its performance in geometric visual tasks such as stereo matching needs to be improved. The paper proposes a ViT Adapter Structure (ViTAS) consisting of three modules: spatial differentiation, patch attention fusion, and cross attention. These modules are used to initialize the feature pyramid, aggregate stereo and multiscale contextual information, and generate fine-grained features. The authors construct a network called ViTAStereo, which combines ViTAS with a cost volume-based stereo matching backend, and achieves the highest ranking on the KITTI Stereo 2012 dataset, outperforming the second-place StereoBase network by approximately 7.9% in terms of error pixel percentage. The research also points out that although some state-of-the-art networks attempt to bypass cost volume construction and directly use Transformers for dense prediction, this design has limited generalization ability on new data. Therefore, the goal of ViTAS is to leverage the strengths of VFM and build cost volume using effective adapters that fully utilize its general depth features, rather than relying solely on these features for opaque disparity regression. The main contributions of the paper include: the first attempt to apply VFM to stereo matching; the proposal of a lightweight PAFM module that effectively learns local and global feature weighting parameters separately; the discussion of the limitations of networks relying solely on the cross attention mechanism in stereo matching; and the demonstration of superior performance and generalization ability of ViTAStereo on multiple public datasets. In conclusion, the paper aims to improve the performance and generalization ability of stereo matching by leveraging the depth features of VFM. It proposes a new adapter structure and validates its effectiveness through experiments.

Playing to Vision Foundation Model's Strengths in Stereo Matching

Faster Self-adaptive Deep Stereo.

Stereo Matching Using Multi-Level Cost Volume and Multi-Scale Feature Constancy

Neural Markov Random Field for Stereo Matching

A Transformer-Based Architecture for High-Resolution Stereo Matching

SCV-Stereo: Learning Stereo Matching from a Sparse Cost Volume

Better Stereo Matching from Simple Yet Effective Wrangling of Deep Features

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

PVStereo: Pyramid Voting Module for End-to-End Self-Supervised Stereo Matching

Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation

MVSFormer: Multi-View Stereo by Learning Robust Image Features and Temperature-based Depth

Ghost-Stereo: GhostNet-based Cost Volume Enhancement and Aggregation for Stereo Matching Networks

Exploiting Semantic and Boundary Information for Stereo Matching

An application of stereo matching algorithm based on transfer learning on robots in multiple scenes

Deep Stereo Matching With Hysteresis Attention and Supervised Cost Volume Construction

CGFNet: 3D Convolution Guided and Multi-scale Volume Fusion Network for fast and robust stereo matching

A Joint 2D-3D Complementary Network for Stereo Matching

Improving Stereo Matching by Incorporating Geometry Prior into Convnet

Accurate and Efficient Stereo Matching via Attention Concatenation Volume

Stereo Matching by Self-supervision of Multiscopic Vision.

ChiTransformer:Towards Reliable Stereo from Cues