Abstract:RGB-Thermal Salient Object Detection aims to pinpoint prominent objects within aligned pairs of visible and thermal infrared images. Traditional encoder-decoder architectures, while designed for cross-modality feature interactions, may not have adequately considered the robustness against noise originating from defective modalities. Inspired by hierarchical human visual systems, we propose the ConTriNet, a robust Confluent Triple-Flow Network employing a Divide-and-Conquer strategy. Specifically, ConTriNet comprises three flows: two modality-specific flows explore cues from RGB and Thermal modalities, and a third modality-complementary flow integrates cues from both modalities. ConTriNet presents several notable advantages. It incorporates a Modality-induced Feature Modulator in the modality-shared union encoder to minimize inter-modality discrepancies and mitigate the impact of defective samples. Additionally, a foundational Residual Atrous Spatial Pyramid Module in the separated flows enlarges the receptive field, allowing for the capture of multi-scale contextual information. Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows. Leveraging the proposed parallel triple-flow framework, we further refine saliency maps derived from different flows through a flow-cooperative fusion strategy, yielding a high-quality, full-resolution saliency map for the final prediction. To evaluate the robustness and stability of our approach, we collect a comprehensive RGB-T SOD benchmark, VT-IMAG, covering various real-world challenging scenarios. Extensive experiments on public benchmarks and our VT-IMAG dataset demonstrate that ConTriNet consistently outperforms state-of-the-art competitors in both common and challenging scenarios.

Cross-Collaborative Fusion-Encoder Network for Robust RGB-Thermal Salient Object Detection.

Cross-Modal Fusion and Progressive Decoding Network for RGB-D Salient Object Detection

CGFNet: Cross-Guided Fusion Network for RGB-T Salient Object Detection

CAFCNet: Cross-modality asymmetric feature complement network for RGB-T salient object detection

CFRNet: Cross-Attention-Based Fusion and Refinement Network for Enhanced RGB-T Salient Object Detection

Unified Information Fusion Network for Multi-Modal RGB-D and RGB-T Salient Object Detection

CFIDNet: cascaded feature interaction decoder for RGB-D salient object detection

Cross-Modality Double Bidirectional Interaction and Fusion Network for RGB-T Salient Object Detection

Interactive Context-Aware Network for RGB-T Salient Object Detection

An adaptive guidance fusion network for RGB-D salient object detection

Efficient Context-Guided Stacked Refinement Network for RGB-T Salient Object Detection

CIR-Net: Cross-Modality Interaction and Refinement for RGB-D Salient Object Detection

C $^{2}$ DFNet: Criss-Cross Dynamic Filter Network for RGB-D Salient Object Detection

HFENet: Hybrid feature encoder network for detecting salient objects in RGB-thermal images

Lightweight Cross-Modal Information Mutual Reinforcement Network for RGB-T Salient Object Detection

Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

MSEDNet: Multi-scale fusion and edge-supervised network for RGB-T salient object detection

Cross-modality Discrepant Interaction Network for RGB-D Salient Object Detection

Complementarity-aware cross-modal feature fusion network for RGB-T semantic segmentation

Modality-Induced Transfer-Fusion Network for RGB-D and RGB-T Salient Object Detection

Real-Time One-Stream Semantic-Guided Refinement Network for RGB-Thermal Salient Object Detection