Abstract:Stereo matching has become a key technique for 3D environment perception in intelligent vehicles. For a considerable time, convolutional neural networks (CNNs) have remained the mainstream choice for feature extraction in this domain. Nonetheless, there is a growing consensus that the existing paradigm should evolve towards vision foundation models (VFM), particularly those developed based on vision Transformers (ViTs) and pre-trained through self-supervision on extensive, unlabeled datasets. While VFMs are adept at extracting informative, general-purpose visual features, specifically for dense prediction tasks, their performance often lacks in geometric vision tasks. This study serves as the first exploration of a viable approach for adapting VFMs to stereo matching. Our ViT adapter, referred to as ViTAS, is constructed upon three types of modules: spatial differentiation, patch attention fusion, and cross-attention. The first module initializes feature pyramids, while the latter two aggregate stereo and multi-scale contextual information into fine-grained features, respectively. ViTAStereo, which combines ViTAS with cost volume-based stereo matching back-end processes, achieves the top rank on the KITTI Stereo 2012 dataset and outperforms the second-best network StereoBase by approximately 7.9% in terms of the percentage of error pixels, with a tolerance of 3 pixels. Additional experiments across diverse scenarios further demonstrate its superior generalizability compared to all other state-of-the-art approaches. We believe this new paradigm will pave the way for the next generation of stereo matching networks.

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data

High-Quality Depth Recovery Via Interactive Multi-view Stereo

MA-Stereo: Real-Time Stereo Matching Via Multi-Scale Attention Fusion and Spatial Error-Aware Refinement

Revisiting Depth Completion from a Stereo Matching Perspective for Cross-domain Generalization

Robust stereo matching with surface normal prediction.

Playing to Vision Foundation Model's Strengths in Stereo Matching

SpatialMe: Stereo Video Conversion Using Depth-Warping and Blend-Inpainting

StereoCrafter-Zero: Zero-Shot Stereo Video Generation with Noisy Restart

Temporally Consistent Stereo Matching

Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light

Single View Stereo Matching

Monocular Contextual Constraint for Stereo Matching with Adaptive Weights Assignment

Better Stereo Matching from Simple Yet Effective Wrangling of Deep Features

Stereo Matching by Self-supervision of Multiscopic Vision.

ImmersePro: End-to-End Stereo Video Synthesis Via Implicit Disparity Learning

Depth-aware Volume Attention for Texture-less Stereo Matching

On the confidence of stereo matching in a deep-learning era: a quantitative evaluation

Stereo-Depth Fusion through Virtual Pattern Projection

ActiveZero++: Mixed Domain Learning Stereo and Confidence-based Depth Completion with Zero Annotation

Stereo matching from monocular images using feature consistency