All-in-One: Transferring Vision Foundation Models into Stereo Matching

Jingyi Zhou,Haoyu Zhang,Jiakang Yuan,Peng Ye,Tao Chen,Hao Jiang,Meiya Chen,Yangyang Zhang
2024-12-13
Abstract:As a fundamental vision task, stereo matching has made remarkable progress. While recent iterative optimization-based methods have achieved promising performance, their feature extraction capabilities still have room for improvement. Inspired by the ability of vision foundation models (VFMs) to extract general representations, in this work, we propose AIO-Stereo which can flexibly select and transfer knowledge from multiple heterogeneous VFMs to a single stereo matching model. To better reconcile features between heterogeneous VFMs and the stereo matching model and fully exploit prior knowledge from VFMs, we proposed a dual-level feature utilization mechanism that aligns heterogeneous features and transfers multi-level knowledge. Based on the mechanism, a dual-level selective knowledge transfer module is designed to selectively transfer knowledge and integrate the advantages of multiple VFMs. Experimental results show that AIO-Stereo achieves start-of-the-art performance on multiple datasets and ranks $1^{st}$ on the Middlebury dataset and outperforms all the published work on the ETH3D benchmark.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of insufficient feature extraction ability in existing models in the stereo matching task. Specifically, although recent methods based on iterative optimization have made significant progress in the stereo matching task, there is still room for improvement in their feature extraction ability. The main problems include: 1. **Low - quality feature extraction**: Existing stereo - matching models mainly focus on the design of iterative update mechanisms, ignoring the feature extraction ability of the encoder. This makes it difficult for the model to learn global and contextual information. 2. **Limited data volume**: The data sets for stereo - matching tasks are relatively small, and most of them are synthetic data, making it difficult for the model to learn general representations from the limited data. 3. **Feature conflict**: Due to differences in training data, methods, and tasks, different Vision Foundation Models (VFMs) have differences and conflicts in their feature representations. Directly using the features of multiple VFMs will lead to feature conflicts and affect the model performance. To solve these problems, the author proposes a new framework named AIO - Stereo to improve the performance of stereo - matching models in the following ways: - **Multi - source knowledge transfer**: Selectively transfer knowledge from multiple heterogeneous Vision Foundation Models to enhance the feature extraction ability of stereo - matching models. - **Two - layer feature utilization mechanism**: Design a two - layer feature utilization mechanism to align features between heterogeneous models and transfer multi - level knowledge. - **Selective knowledge transfer module**: Introduce a two - layer selective knowledge transfer module to selectively transfer knowledge and make full use of the advantages of multiple VFMs. Through these methods, AIO - Stereo can achieve state - of - the - art performance on multiple data sets, especially ranking first on the Middlebury data set and outperforming all published works on the ETH3D benchmark. ### Summary The main goal of this paper is to improve the feature extraction ability of stereo - matching models by introducing knowledge transfer of Vision Foundation Models, thereby improving their performance in practical applications.