C4Net: Excavating Cross-modal Context- and Content-Complementarity for RGB-T Semantic Segmentation

Shenlu Zhao,Jingyi Li,Qiang Zhang
DOI: https://doi.org/10.1109/tcsvt.2024.3485655
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:The complementary properties exhibited upon RGB-T data involve context complementarity as well as content complementarity. During cross-modal feature fusion, most existing RGB-T semantic segmentation methods are dedicated to highlighting the exploitation of content-complementary information. Unfortunately, these methods usually overlook the excavation of cross-modal context-complementary information ( i.e ., the contextual dependencies among different regions that only exist in one certain modality data) or try to exploit such cross-modal context-complementary information in an implicit way, yielding fragmentary semantic segmentation results. To remedy this problem, in this paper, a novel Cross-modal Context- and Content-Complementarity Network (C 4 Net) is presented for RGB-T semantic segmentation, in which both the cross-modal context-complementary information and the cross-modal content-complementary information are fully excavated and exploited during cross-modal feature fusion. Specifically, a Context-Complementary Information Aggregation (CxCIA) module is carefully designed, in which the cross-modal context-complementary information is explicitly excavated by measuring the discrepancies between contextual dependencies from different modality data. Then, such cross-modal context-complementary information is further exploited to enhance the original RGB and thermal contextual dependencies for boosting the integrity of objects in the fused features. In the meantime, a Content-Complementary Information Aggregation (CnCIA) module is presented, which highlights the utilization of cross-modal content-complementary information from a multi-scale perspective. Furthermore, an MLP-based Multi-level Feature Interaction (MFI) decoder is presented, in which the semantic gaps among different levels of fused features are mitigated by establishing the interactions of multi-level fused features along spatial and channel dimensions. Comprehensive experimental results on several public datasets demonstrate that our proposed C 4 Net surpasses other state-of-the-art models.
What problem does this paper attempt to address?