Mono2Stereo: Monocular Knowledge Transfer for Enhanced Stereo Matching

Yuran Wang,Yingping Liang,Hesong Li,Ying Fu
2024-11-14
Abstract:The generalization and performance of stereo matching networks are limited due to the domain gap of the existing synthetic datasets and the sparseness of GT labels in the real datasets. In contrast, monocular depth estimation has achieved significant advancements, benefiting from large-scale depth datasets and self-supervised strategies. To bridge the performance gap between monocular depth estimation and stereo matching, we propose leveraging monocular knowledge transfer to enhance stereo matching, namely Mono2Stereo. We introduce knowledge transfer with a two-stage training process, comprising synthetic data pre-training and real-world data fine-tuning. In the pre-training stage, we design a data generation pipeline that synthesizes stereo training data from monocular images. This pipeline utilizes monocular depth for warping and novel view synthesis and employs our proposed Edge-Aware (EA) inpainting module to fill in missing contents in the generated images. In the fine-tuning stage, we introduce a Sparse-to-Dense Knowledge Distillation (S2DKD) strategy encouraging the distributions of predictions to align with dense monocular depths. This strategy mitigates issues with edge blurring in sparse real-world labels and enhances overall consistency. Experimental results demonstrate that our pre-trained model exhibits strong zero-shot generalization capabilities. Furthermore, domain-specific fine-tuning using our pre-trained model and S2DKD strategy significantly increments in-domain performance. The code will be made available soon.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve two key problems in stereo matching: 1. **Domain Gap**: The differences between existing synthetic datasets and real - world data lead to limited generalization ability of models. Most synthetic datasets only contain indoor scenes, while stereo - matching scenes in practical applications are usually outdoor scenes. This makes it difficult for models trained on synthetic data to adapt well to real - world data. 2. **Sparse Labels**: Annotations in real - world datasets are usually very sparse, especially in outdoor datasets obtained using LiDAR. These sparse labels cannot provide sufficient supervision signals, thus limiting the performance of models in terms of details and consistency. To solve these problems, the authors propose a new framework - **Mono2Stereo**, which enhances the performance and generalization ability of stereo - matching models through knowledge transfer in monocular depth estimation. Specifically, they design a two - stage training process: - **Pre - training Stage**: Generate realistic stereo image pairs from monocular images to construct a high - quality pre - training dataset. - **Fine - tuning Stage**: Introduce a Sparse - to - Dense Knowledge Distillation strategy (S2DKD), and use the results of monocular depth estimation to supplement the missing information in sparse labels, especially in edge and detail areas. ### Main Contributions 1. **Propose a stereo data generation framework**: Use monocular images and an Edge - Aware inpainting module to generate highly realistic stereo training data. 2. **Introduce the Sparse - to - Dense Knowledge Distillation strategy (S2DKD)**: During the fine - tuning process, use the knowledge of monocular depth estimation to enhance edge details, especially in the case of sparse labels. 3. **Emphasize the importance of monocular depth estimation for deep stereo networks**: Through extensive experiments, it is proved that the models trained using this method achieve state - of - the - art results on multiple datasets. ### Formula Summary - **Definition of Disparity Map**: \[ D(i)=x_{l}(i)-x_{r}(i') \] where \(D(i)\) is the disparity value of pixel \(i\), and \(x_{l}(i)\) and \(x_{r}(i')\) are the horizontal coordinates of pixel \(i\) and \(i'\) in the left and right views respectively. - **Relative Disparity Transformation**: \[ D'_{\text{mono}} = f\cdot D_{\text{mono}} \] where \(f\in [d_{\text{min}}, d_{\text{max}}]\), is a scaling factor randomly sampled from the uniform distribution \(U(d_{\text{min}}, d_{\text{max}})\). - **KL Divergence Loss Function**: \[ L_{\text{KL}}=\sum_{i,j}\text{KL}(D_{\text{out}}(i,j),D_{\text{mono}}(i,j)) \] where \(\text{KL}(·)\) represents the KL divergence. - **Final Loss Function**: \[ L = L_{\text{sparse}}+\alpha L_{\text{KL}} \] where \(\alpha\) is a weighting factor that controls the proportion of the two losses. Through these methods, the authors successfully improve the performance and generalization ability of stereo - matching models, especially their performance when dealing with real - world data.