Abstract:Holistic scene understanding is pivotal for the performance of autonomous machines. In this paper we propose a new end-to-end model for performing semantic segmentation and depth completion jointly. The vast majority of recent approaches have developed semantic segmentation and depth completion as independent tasks. Our approach relies on RGB and sparse depth as inputs to our model and produces a dense depth map and the corresponding semantic segmentation image. It consists of a feature extractor, a depth completion branch, a semantic segmentation branch and a joint branch which further processes semantic and depth information altogether. The experiments done on Virtual KITTI 2 dataset, demonstrate and provide further evidence, that combining both tasks, semantic segmentation and depth completion, in a multi-task network can effectively improve the performance of each task. Code is available at <a class="link-external link-https" href="https://github.com/juanb09111/semantic" rel="external noopener nofollow">this https URL</a> depth.

What problem does this paper attempt to address?

This paper attempts to address the problem of how to simultaneously perform Semantic Segmentation and Depth Completion using a multi-task network model in computer vision, thereby improving the performance of each task. Specifically, the authors propose a new end-to-end multi-task network model called **SemSegDepth**, which can take RGB images and sparse depth maps as input and output dense depth maps and corresponding semantic segmentation maps. ### Background and Motivation 1. **Importance of Scene Understanding**: - Scene understanding is crucial for the performance of autonomous machines. Humans, when observing objects, can unconsciously attribute multiple properties to what they see and perform multiple tasks simultaneously, such as evaluating the distance, quantity, size, and texture of objects. - In contrast, although machines have surpassed humans in some single tasks, they still lag in performing multiple tasks simultaneously. 2. **Advantages of Multi-task Networks**: - Multi-task networks can not only provide a more complete scene representation but also improve the performance of each task through joint learning. - For example, Panoptic Segmentation, which combines instance segmentation, object detection, and semantic segmentation, has shown excellent performance in multiple tasks. 3. **Limitations of Existing Methods**: - Most existing methods treat semantic segmentation and depth completion as independent tasks and do not fully exploit their interrelationships. - Combining these two tasks requires handling heterogeneous data (RGB images and sparse depth data), which brings additional challenges. ### Main Contributions of the Paper 1. **Proposing a New Multi-task Network Model**: - The **SemSegDepth** model combines two benchmark models: one for depth completion (based on the work of Chen et al.) and the other for semantic segmentation (based on the EfficientPS model). - The model includes a feature extractor, a semantic segmentation branch, a depth completion branch, and a joint branch, which together process semantic and depth information. 2. **Experimental Validation**: - Experimental results on the Virtual KITTI 2 dataset show that the **SemSegDepth** model significantly outperforms baseline models in both semantic segmentation and depth completion tasks. - Ablation experiments further validate the effectiveness of the multi-task network, particularly the performance improvement brought by sharing feature extractor weights and computing joint loss. ### Conclusion - This paper successfully demonstrates that multi-task networks can significantly improve the performance of each task through joint learning. - The **SemSegDepth** model can not only generate dense depth maps but also provide accurate semantic segmentation results, showing potential for practical applications, especially in fields like autonomous driving. Through this multi-task network design, the authors provide a new solution for scene understanding in computer vision, which is expected to play an important role in future autonomous systems.

SemSegDepth: A Combined Model for Semantic Segmentation and Depth Completion

Semantic-guided Depth Completion from Monocular Images and 4D Radar Data

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

PanDepth: Joint Panoptic Segmentation and Depth Completion

SemAttNet: Towards Attention-based Semantic Aware Guided Depth Completion

Simultaneous Semantic Segmentation and Depth Completion with Constraint of Boundary

Semantic Reconstruction based on RGB Image and Sparse Depth

To Complete or to Estimate, That is the Question: A Multi-Task Approach to Depth Completion and Monocular Depth Estimation

Deep Sparse Depth Completion Using Multi-Scale Residuals and Channel Shuffle

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

Object Semantics Give Us the Depth We Need: Multi-task Approach to Aerial Depth Completion

Semantic Segmentation-assisted Scene Completion for LiDAR Point Clouds

DepthSSC: Depth-Spatial Alignment and Dynamic Voxel Resolution for Monocular 3D Semantic Scene Completion

SOSD-Net: Joint Semantic Object Segmentation and Depth Estimation from Monocular Images

PENet: Towards Precise and Efficient Image Guided Depth Completion

On Exploring Shape and Semantic Enhancements for RGB-X Semantic Segmentation

Hybridnet for depth estimation and semantic segmentation

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

SemHint-MD: Learning from Noisy Semantic Labels for Self-Supervised Monocular Depth Estimation

Incorporating Luminance, Depth and Color Information by a Fusion-based Network for Semantic Segmentation