Abstract:The generalization and performance of stereo matching networks are limited due to the domain gap of the existing synthetic datasets and the sparseness of GT labels in the real datasets. In contrast, monocular depth estimation has achieved significant advancements, benefiting from large-scale depth datasets and self-supervised strategies. To bridge the performance gap between monocular depth estimation and stereo matching, we propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. In the pre-training stage, we design a data generation pipeline that synthesizes stereo training data from monocular images. This pipeline utilizes monocular depth for warping and novel view synthesis and employs our proposed Edge-Aware (EA) inpainting module to fill in missing contents in the generated images. In the fine-tuning stage, we introduce a Sparse-to-Dense Knowledge Distillation (S2DKD) strategy encouraging the distributions of predictions to align with dense monocular depths. This strategy mitigates issues with edge blurring in sparse real-world labels and enhances overall consistency. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot generalization capabilities. Furthermore, domain-specific fine-tuning using our pre-trained model and S2DKD strategy significantly increments in-domain performance. The code will be made available soon.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two key problems in stereo matching: 1. **Domain Gap**: The differences between existing synthetic datasets and real - world data lead to limited generalization ability of models. Most synthetic datasets only contain indoor scenes, while stereo - matching scenes in practical applications are usually outdoor scenes. This makes it difficult for models trained on synthetic data to adapt well to real - world data. 2. **Sparse Labels**: Annotations in real - world datasets are usually very sparse, especially in outdoor datasets obtained using LiDAR. These sparse labels cannot provide sufficient supervision signals, thus limiting the performance of models in terms of details and consistency. To solve these problems, the authors propose a new framework - **Mono2Stereo**, which enhances the performance and generalization ability of stereo - matching models through knowledge transfer in monocular depth estimation. Specifically, they design a two - stage training process: - **Pre - training Stage**: Generate realistic stereo image pairs from monocular images to construct a high - quality pre - training dataset. - **Fine - tuning Stage**: Introduce a Sparse - to - Dense Knowledge Distillation strategy (S2DKD), and use the results of monocular depth estimation to supplement the missing information in sparse labels, especially in edge and detail areas. ### Main Contributions 1. **Propose a stereo data generation framework**: Use monocular images and an Edge - Aware inpainting module to generate highly realistic stereo training data. 2. **Introduce the Sparse - to - Dense Knowledge Distillation strategy (S2DKD)**: During the fine - tuning process, use the knowledge of monocular depth estimation to enhance edge details, especially in the case of sparse labels. 3. **Emphasize the importance of monocular depth estimation for deep stereo networks**: Through extensive experiments, it is proved that the models trained using this method achieve state - of - the - art results on multiple datasets. ### Formula Summary - **Definition of Disparity Map**: \[ D(i)=x_{l}(i)-x_{r}(i') \] where \(D(i)\) is the disparity value of pixel \(i\), and \(x_{l}(i)\) and \(x_{r}(i')\) are the horizontal coordinates of pixel \(i\) and \(i'\) in the left and right views respectively. - **Relative Disparity Transformation**: \[ D'_{\text{mono}} = f\cdot D_{\text{mono}} \] where \(f\in [d_{\text{min}}, d_{\text{max}}]\), is a scaling factor randomly sampled from the uniform distribution \(U(d_{\text{min}}, d_{\text{max}})\). - **KL Divergence Loss Function**: \[ L_{\text{KL}}=\sum_{i,j}\text{KL}(D_{\text{out}}(i,j),D_{\text{mono}}(i,j)) \] where \(\text{KL}(·)\) represents the KL divergence. - **Final Loss Function**: \[ L = L_{\text{sparse}}+\alpha L_{\text{KL}} \] where \(\alpha\) is a weighting factor that controls the proportion of the two losses. Through these methods, the authors successfully improve the performance and generalization ability of stereo - matching models, especially their performance when dealing with real - world data.

Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Learning Monocular Depth Estimation via Selective Distillation of Stereo Knowledge

Learning Monocular Depth by Distilling Cross-domain Stereo Networks

MonoTAKD: Teaching Assistant Knowledge Distillation for Monocular 3D Object Detection

MonoSKD: General Distillation Framework for Monocular 3D Object Detection via Spearman Correlation Coefficient

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

Stereo-Matching Knowledge Distilled Monocular Depth Estimation Filtered by Multiple Disparity Consistency

RealMonoDepth: Self-Supervised Monocular Depth Estimation for General Scenes

Pseudo-Mono for Monocular 3D Object Detection in Autonomous Driving

Deep Digging into the Generalization of Self-Supervised Monocular Depth Estimation

MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection

WaterMono: Teacher-Guided Anomaly Masking and Enhancement Boosting for Robust Underwater Self-Supervised Monocular Depth Estimation

Stereo Matching by Self-supervision of Multiscopic Vision.

SelfTune: Metrically Scaled Monocular Depth Estimation through Self-Supervised Learning

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer

Stereo Anything: Unifying Stereo Matching with Large-Scale Mixed Data

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild.

Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light

G2-MonoDepth: A General Framework of Generalized Depth Inference from Monocular RGB+X Data