X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task Distillation

Hong Cai,Janarbek Matai,Shubhankar Borse,Yizhe Zhang,Amin Ansari,Fatih Porikli
DOI: https://doi.org/10.48550/arXiv.2110.12516
2021-10-25
Abstract:In this paper, we propose a novel method, X-Distill, to improve the self-supervised training of monocular depth via cross-task knowledge distillation from semantic segmentation to depth estimation. More specifically, during training, we utilize a pretrained semantic segmentation teacher network and transfer its semantic knowledge to the depth network. In order to enable such knowledge distillation across two different visual tasks, we introduce a small, trainable network that translates the predicted depth map to a semantic segmentation map, which can then be supervised by the teacher network. In this way, this small network enables the backpropagation from the semantic segmentation teacher's supervision to the depth network during training. In addition, since the commonly used object classes in semantic segmentation are not directly transferable to depth, we study the visual and geometric characteristics of the objects and design a new way of grouping them that can be shared by both tasks. It is noteworthy that our approach only modifies the training process and does not incur additional computation during inference. We extensively evaluate the efficacy of our proposed approach on the standard KITTI benchmark and compare it with the latest state of the art. We further test the generalizability of our approach on Make3D. Overall, the results show that our approach significantly improves the depth estimation accuracy and outperforms the state of the art.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the improvement from semantic segmentation to depth estimation through cross - task knowledge distillation, in order to enhance the self - supervised training effect of monocular depth estimation. Specifically, the authors propose a new method named X - Distill, which aims to utilize a pre - trained semantic segmentation teacher network to transfer its semantic knowledge into the depth network, thereby enhancing the depth network's ability to understand visual scenes. This process mainly addresses the following two key challenges: 1. **Cross - task knowledge transfer**: Since depth estimation and semantic segmentation are two different visual tasks and their outputs are not directly comparable, how to achieve effective knowledge transfer between them is a challenge. X - Distill generates a semantic segmentation map based on the predicted depth by introducing a small, trainable network (called Depth - to - Segmentation Network, D2S), so that the depth network can receive the supervision signal from the semantic segmentation teacher network. 2. **Re - grouping of semantic categories**: Traditional semantic segmentation categories are usually too fine - grained and not suitable for direct application to depth estimation. For example, roads and sidewalks are usually divided into two categories in semantic segmentation, but in the depth map, they are both on the ground and have similar depth change patterns. Therefore, X - Distill redesigns the semantic categories and groups objects with similar visual and geometric properties together to adapt to the characteristics of depth information. Through these innovations, X - Distill not only improves the accuracy of depth estimation during the training process, but also does not require additional computation to process or generate semantic information during the inference stage, thus maintaining efficient running performance. The paper conducts extensive evaluations on the standard KITTI benchmark dataset and compares with the latest self - supervised monocular depth estimation methods. The results show that X - Distill significantly improves the accuracy of depth estimation and outperforms existing methods on multiple metrics. In addition, the paper also verifies the generalization ability of the method on the Make3D dataset.