SemSegDepth: A Combined Model for Semantic Segmentation and Depth Completion

Juan Pablo Lagos,Esa Rahtu
DOI: https://doi.org/10.5220/0010838500003124
2024-03-06
Abstract:Holistic scene understanding is pivotal for the performance of autonomous machines. In this paper we propose a new end-to-end model for performing semantic segmentation and depth completion jointly. The vast majority of recent approaches have developed semantic segmentation and depth completion as independent tasks. Our approach relies on RGB and sparse depth as inputs to our model and produces a dense depth map and the corresponding semantic segmentation image. It consists of a feature extractor, a depth completion branch, a semantic segmentation branch and a joint branch which further processes semantic and depth information altogether. The experiments done on Virtual KITTI 2 dataset, demonstrate and provide further evidence, that combining both tasks, semantic segmentation and depth completion, in a multi-task network can effectively improve the performance of each task. Code is available at <a class="link-external link-https" href="https://github.com/juanb09111/semantic" rel="external noopener nofollow">this https URL</a> depth.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to address the problem of how to simultaneously perform Semantic Segmentation and Depth Completion using a multi-task network model in computer vision, thereby improving the performance of each task. Specifically, the authors propose a new end-to-end multi-task network model called **SemSegDepth**, which can take RGB images and sparse depth maps as input and output dense depth maps and corresponding semantic segmentation maps. ### Background and Motivation 1. **Importance of Scene Understanding**: - Scene understanding is crucial for the performance of autonomous machines. Humans, when observing objects, can unconsciously attribute multiple properties to what they see and perform multiple tasks simultaneously, such as evaluating the distance, quantity, size, and texture of objects. - In contrast, although machines have surpassed humans in some single tasks, they still lag in performing multiple tasks simultaneously. 2. **Advantages of Multi-task Networks**: - Multi-task networks can not only provide a more complete scene representation but also improve the performance of each task through joint learning. - For example, Panoptic Segmentation, which combines instance segmentation, object detection, and semantic segmentation, has shown excellent performance in multiple tasks. 3. **Limitations of Existing Methods**: - Most existing methods treat semantic segmentation and depth completion as independent tasks and do not fully exploit their interrelationships. - Combining these two tasks requires handling heterogeneous data (RGB images and sparse depth data), which brings additional challenges. ### Main Contributions of the Paper 1. **Proposing a New Multi-task Network Model**: - The **SemSegDepth** model combines two benchmark models: one for depth completion (based on the work of Chen et al.) and the other for semantic segmentation (based on the EfficientPS model). - The model includes a feature extractor, a semantic segmentation branch, a depth completion branch, and a joint branch, which together process semantic and depth information. 2. **Experimental Validation**: - Experimental results on the Virtual KITTI 2 dataset show that the **SemSegDepth** model significantly outperforms baseline models in both semantic segmentation and depth completion tasks. - Ablation experiments further validate the effectiveness of the multi-task network, particularly the performance improvement brought by sharing feature extractor weights and computing joint loss. ### Conclusion - This paper successfully demonstrates that multi-task networks can significantly improve the performance of each task through joint learning. - The **SemSegDepth** model can not only generate dense depth maps but also provide accurate semantic segmentation results, showing potential for practical applications, especially in fields like autonomous driving. Through this multi-task network design, the authors provide a new solution for scene understanding in computer vision, which is expected to play an important role in future autonomous systems.