Abstract:As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of insufficient feature extraction ability in existing models in the stereo matching task. Specifically, although recent methods based on iterative optimization have made significant progress in the stereo matching task, there is still room for improvement in their feature extraction ability. The main problems include: 1. **Low - quality feature extraction**: Existing stereo - matching models mainly focus on the design of iterative update mechanisms, ignoring the feature extraction ability of the encoder. This makes it difficult for the model to learn global and contextual information. 2. **Limited data volume**: The data sets for stereo - matching tasks are relatively small, and most of them are synthetic data, making it difficult for the model to learn general representations from the limited data. 3. **Feature conflict**: Due to differences in training data, methods, and tasks, different Vision Foundation Models (VFMs) have differences and conflicts in their feature representations. Directly using the features of multiple VFMs will lead to feature conflicts and affect the model performance. To solve these problems, the author proposes a new framework named AIO - Stereo to improve the performance of stereo - matching models in the following ways: - **Multi - source knowledge transfer**: Selectively transfer knowledge from multiple heterogeneous Vision Foundation Models to enhance the feature extraction ability of stereo - matching models. - **Two - layer feature utilization mechanism**: Design a two - layer feature utilization mechanism to align features between heterogeneous models and transfer multi - level knowledge. - **Selective knowledge transfer module**: Introduce a two - layer selective knowledge transfer module to selectively transfer knowledge and make full use of the advantages of multiple VFMs. Through these methods, AIO - Stereo can achieve state - of - the - art performance on multiple data sets, especially ranking first on the Middlebury data set and outperforming all published works on the ETH3D benchmark. ### Summary The main goal of this paper is to improve the feature extraction ability of stereo - matching models by introducing knowledge transfer of Vision Foundation Models, thereby improving their performance in practical applications.

All-in-One: Transferring Vision Foundation Models into Stereo Matching

Playing to Vision Foundation Model's Strengths in Stereo Matching

Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data

Better Stereo Matching from Simple Yet Effective Wrangling of Deep Features

MC-Stereo: Multi-peak Lookup and Cascade Search Range for Stereo Matching

Stereo Matching Using Multi-Level Cost Volume and Multi-Scale Feature Constancy

One to Transfer All: A Universal Transfer Framework for Vision Foundation Model with Few Data

A New Principle toward Robust Matching in Human-like Stereovision

ViM: Vision Middleware for Unified Downstream Transferring

Stereo matching from monocular images using feature consistency

EAI-Stereo: Error Aware Iterative Network for Stereo Matching

Stereo Anywhere: Robust Zero-Shot Deep Stereo Matching Even Where Either Stereo or Mono Fail

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

A Transformer-Based Architecture for High-Resolution Stereo Matching

Multi-Dimensional Cooperative Network for Stereo Matching

High-Frequency Stereo Matching Network

Practical Stereo Matching via Cascaded Recurrent Network with Adaptive Correlation

End-to-end information fusion method for transformer-based stereo matching

Potential efficacy of interleukin-1β inhibition in lung cancer

A Joint 2D-3D Complementary Network for Stereo Matching

A Multimodal-Based Feature Generalization Model for Binocular Stereo Matching