TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo Matching within A Joint Learning Framework

Guanfeng Tang,Zhiyuan Wu,Jiahang Li,Ping Zhong,Xieyuanli Chen,Huiming Lu,Rui Fan
2024-09-10
Abstract:Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study introduces three novelties: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI and vKITTI2 datasets, along with qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function, and demonstrate its superior performance compared to prior arts, with a notable increase in mIoU by over 9%. Our source code will be publicly available at mias.group/TiCoSS upon publication.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem this paper attempts to address is: how to more closely integrate the tasks of semantic segmentation and stereo matching within a joint learning framework to improve the performance of both tasks. Specifically, existing methods typically handle these two tasks separately, which leads to a failure to fully exploit the complementarity between contextual information and geometric information. For example, stereo matching networks may produce blurry disparity estimates in areas with less texture or occlusion, while semantic segmentation can provide pixel-level scene understanding to eliminate these ambiguities. Conversely, semantic segmentation networks find it difficult to distinguish clear object boundaries in complex driving scenes due to the lack of spatial geometric information. Therefore, the paper proposes a new joint learning framework—TiCoSS (Tightly-Coupled Semantic Segmentation and Stereo Matching Network), which achieves closer integration through the following three key techniques: 1. **Tightly-Coupled Gated Feature Fusion (TGF)**: Utilizing a series of Selective Inheritance Gates (SIGs) to pass useful contextual and geometric information from the previous layer to the current layer, thereby achieving more effective feature fusion. 2. **Hierarchical Deep Supervision (HDS)**: Using the highest resolution fused feature maps to guide the deep supervision of each branch, as these feature maps contain the richest local spatial details. 3. **Coupling Tightening Loss (CT)**: Including stereo matching loss, semantic consistency guided loss, disparity inconsistency aware loss, and deep supervision consistency constraint loss to further strengthen the coupling between the two tasks. Through these techniques, TiCoSS is able to achieve significantly better performance than existing methods on the KITTI and vKITTI2 datasets, particularly improving the mIoU metric by over 9%.