Abstract:In this paper, we propose a novel method, X-Distill, to improve the self-supervised training of monocular depth via cross-task knowledge distillation from semantic segmentation to depth estimation. More specifically, during training, we utilize a pretrained semantic segmentation teacher network and transfer its semantic knowledge to the depth network. In order to enable such knowledge distillation across two different visual tasks, we introduce a small, trainable network that translates the predicted depth map to a semantic segmentation map, which can then be supervised by the teacher network. In this way, this small network enables the backpropagation from the semantic segmentation teacher's supervision to the depth network during training. In addition, since the commonly used object classes in semantic segmentation are not directly transferable to depth, we study the visual and geometric characteristics of the objects and design a new way of grouping them that can be shared by both tasks. It is noteworthy that our approach only modifies the training process and does not incur additional computation during inference. We extensively evaluate the efficacy of our proposed approach on the standard KITTI benchmark and compare it with the latest state of the art. We further test the generalizability of our approach on Make3D. Overall, the results show that our approach significantly improves the depth estimation accuracy and outperforms the state of the art.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the improvement from semantic segmentation to depth estimation through cross - task knowledge distillation, in order to enhance the self - supervised training effect of monocular depth estimation. Specifically, the authors propose a new method named X - Distill, which aims to utilize a pre - trained semantic segmentation teacher network to transfer its semantic knowledge into the depth network, thereby enhancing the depth network's ability to understand visual scenes. This process mainly addresses the following two key challenges: 1. **Cross - task knowledge transfer**: Since depth estimation and semantic segmentation are two different visual tasks and their outputs are not directly comparable, how to achieve effective knowledge transfer between them is a challenge. X - Distill generates a semantic segmentation map based on the predicted depth by introducing a small, trainable network (called Depth - to - Segmentation Network, D2S), so that the depth network can receive the supervision signal from the semantic segmentation teacher network. 2. **Re - grouping of semantic categories**: Traditional semantic segmentation categories are usually too fine - grained and not suitable for direct application to depth estimation. For example, roads and sidewalks are usually divided into two categories in semantic segmentation, but in the depth map, they are both on the ground and have similar depth change patterns. Therefore, X - Distill redesigns the semantic categories and groups objects with similar visual and geometric properties together to adapt to the characteristics of depth information. Through these innovations, X - Distill not only improves the accuracy of depth estimation during the training process, but also does not require additional computation to process or generate semantic information during the inference stage, thus maintaining efficient running performance. The paper conducts extensive evaluations on the standard KITTI benchmark dataset and compares with the latest self - supervised monocular depth estimation methods. The results show that X - Distill significantly improves the accuracy of depth estimation and outperforms existing methods on multiple metrics. In addition, the paper also verifies the generalization ability of the method on the Make3D dataset.

X-Distill: Improving Self-Supervised Monocular Depth via Cross-Task Distillation

Self-Paced Knowledge Distillation for Real-Time Image Guided Depth Completion

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

ADU-Depth: Attention-based Distillation with Uncertainty Modeling for Depth Estimation

Promoting CNNs with Cross-Architecture Knowledge Distillation for Efficient Monocular Depth Estimation

Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

Edge Devices Friendly Self-Supervised Monocular Depth Estimation Via Knowledge Distillation.

Selective Transfer Learning of Cross-Modality Distillation for Monocular 3D Object Detection

Semi-Supervised Learning with Mutual Distillation for Monocular Depth Estimation

Integrating Semantic Segmentation Model for Self-Supervised Scene Flow Estimation Via Cross Task Distillation

Distilling Inter-Class Distance for Semantic Segmentation

Learning Monocular Depth Estimation via Selective Distillation of Stereo Knowledge

Dual-attention-based semantic-aware self-supervised monocular depth estimation

Depth Removal Distillation for RGB-D Semantic Segmentation

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View

Self-supervised Monocular Depth Estimation with Self-Distillation and Dense Skip Connection

Monocular Depth Estimation via Self-Supervised Self-Distillation

SemHint-MD: Learning from Noisy Semantic Labels for Self-Supervised Monocular Depth Estimation

Spirit Distillation: Precise Real-time Semantic Segmentation of Road Scenes with Insufficient Data

X$^3$KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection