Abstract:Semantic segmentation and stereo matching, respectively analogous to the ventral and dorsal streams in our human brain, are two key components of autonomous driving perception systems. Addressing these two tasks with separate networks is no longer the mainstream direction in developing computer vision algorithms, particularly with the recent advances in large vision models and embodied artificial intelligence. The trend is shifting towards combining them within a joint learning framework, especially emphasizing feature sharing between the two tasks. The major contributions of this study lie in comprehensively tightening the coupling between semantic segmentation and stereo matching. Specifically, this study introduces three novelties: (1) a tightly coupled, gated feature fusion strategy, (2) a hierarchical deep supervision strategy, and (3) a coupling tightening loss function. The combined use of these technical contributions results in TiCoSS, a state-of-the-art joint learning framework that simultaneously tackles semantic segmentation and stereo matching. Through extensive experiments on the KITTI and vKITTI2 datasets, along with qualitative and quantitative analyses, we validate the effectiveness of our developed strategies and loss function, and demonstrate its superior performance compared to prior arts, with a notable increase in mIoU by over 9%. Our source code will be publicly available at mias.group/TiCoSS upon publication.

What problem does this paper attempt to address?

The problem this paper attempts to address is: how to more closely integrate the tasks of semantic segmentation and stereo matching within a joint learning framework to improve the performance of both tasks. Specifically, existing methods typically handle these two tasks separately, which leads to a failure to fully exploit the complementarity between contextual information and geometric information. For example, stereo matching networks may produce blurry disparity estimates in areas with less texture or occlusion, while semantic segmentation can provide pixel-level scene understanding to eliminate these ambiguities. Conversely, semantic segmentation networks find it difficult to distinguish clear object boundaries in complex driving scenes due to the lack of spatial geometric information. Therefore, the paper proposes a new joint learning framework—TiCoSS (Tightly-Coupled Semantic Segmentation and Stereo Matching Network), which achieves closer integration through the following three key techniques: 1. **Tightly-Coupled Gated Feature Fusion (TGF)**: Utilizing a series of Selective Inheritance Gates (SIGs) to pass useful contextual and geometric information from the previous layer to the current layer, thereby achieving more effective feature fusion. 2. **Hierarchical Deep Supervision (HDS)**: Using the highest resolution fused feature maps to guide the deep supervision of each branch, as these feature maps contain the richest local spatial details. 3. **Coupling Tightening Loss (CT)**: Including stereo matching loss, semantic consistency guided loss, disparity inconsistency aware loss, and deep supervision consistency constraint loss to further strengthen the coupling between the two tasks. Through these techniques, TiCoSS is able to achieve significantly better performance than existing methods on the KITTI and vKITTI2 datasets, particularly improving the mIoU metric by over 9%.

TiCoSS: Tightening the Coupling between Semantic Segmentation and Stereo Matching within A Joint Learning Framework

S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching for Autonomous Driving

Co-Teaching: An Ark to Unsupervised Stereo Matching

Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework With Spatio-Temporal Collaboration

STC: Spatio-Temporal Contrastive Learning for Video Instance Segmentation.

Joint Semantic Segmentation using representations of LiDAR point clouds and camera images

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

SSNet: a joint learning network for semantic segmentation and disparity estimation

A Joint 2D-3D Complementary Network for Stereo Matching

A unified and efficient semi-supervised learning framework for stereo matching

S3Net: Innovating Stereo Matching and Semantic Segmentation with a Single-Branch Semantic Stereo Network in Satellite Epipolar Imagery

EAI-Stereo: Error Aware Iterative Network for Stereo Matching

Playing to Vision Foundation Model's Strengths in Stereo Matching

STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation

Brain Cholesterol XVIII: Effect of Methylphenidate (Ritalin) on [U-14C] Glucose and [2-3H] Acetate Incorporation

Bridging Stereo Matching and Optical Flow via Spatiotemporal Correspondence

A Multi-phase Camera-LiDAR Fusion Network for 3D Semantic Segmentation with Weak Supervision

Revisiting Multi-modal 3D Semantic Segmentation in Real-world Autonomous Driving

Superpixel Guided Network for Three-Dimensional Stereo Matching

3D LiDAR and Stereo Fusion using Stereo Matching Network with Conditional Cost Volume Normalization

Learning Spatial and Temporal Variations for 4D Point Cloud Segmentation