Abstract:Alignment-free RGB-Thermal (RGB-T) salient object detection (SOD) aims to achieve robust performance in complex scenes by directly leveraging the complementary information from unaligned visible-thermal image pairs, without requiring manual alignment. However, the labor-intensive process of collecting and annotating image pairs limits the scale of existing benchmarks, hindering the advancement of alignment-free RGB-T SOD. In this paper, we construct a large-scale and high-diversity unaligned RGB-T SOD dataset named UVT20K, comprising 20,000 image pairs, 407 scenes, and 1256 object categories. All samples are collected from real-world scenarios with various challenges, such as low illumination, image clutter, complex salient objects, and so on. To support the exploration for further research, each sample in UVT20K is annotated with a comprehensive set of ground truths, including saliency masks, scribbles, boundaries, and challenge attributes. In addition, we propose a Progressive Correlation Network (PCNet), which models inter- and intra-modal correlations on the basis of explicit alignment to achieve accurate predictions in unaligned image pairs. Extensive experiments conducted on unaligned and aligned datasets demonstrate the effectiveness of our <a class="link-external link-http" href="http://method.Code" rel="external noopener nofollow">this http URL</a> and dataset are available at <a class="link-external link-https" href="https://github.com/Angknpng/PCNet" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to achieve robust salient object detection (SOD) in complex scenes by directly utilizing the complementary information in unaligned visible - thermal (RGB - T) image pairs without manual alignment. Specifically, most of the existing methods rely on aligned multi - modal datasets, which limits their performance and deployment in practical applications. In addition, the existing unaligned datasets are small in scale and lack training sets, and cannot fully support the learning and evaluation of models. To address these challenges, the authors propose the following solutions: 1. **Construct a large - scale unaligned RGB - T SOD dataset**: - Construct a large - scale, highly diverse unaligned RGB - T SOD dataset named UVT20K, which contains 20,000 image pairs, 407 scenes and 1,256 object categories. - The dataset covers various real - world challenges, such as low - light, image clutter, complex salient objects, etc. - Each sample is annotated with comprehensive ground - truth information, including saliency masks, scribbles, boundaries and challenge attributes. 2. **Propose a Progressive Correlation Network (PCNet)**: - PCNet aims to handle unaligned RGB - T image pairs by explicitly aligning and gradually modeling cross - modal and intra - modal correlations. - Introduce a Semantic - Guided Homography Estimation module (SHE) to explicitly align the common areas between RGB and thermal imaging. - Propose a Cross - Modal and Intra - Modal Correlation module (IIMC) to gradually model the correlations of salient regions. ### Formula Summary 1. **Homography Matrix Estimation**: \[ H = H(\Psi (\Phi(I_{rgb}), \Phi(I_t))) \] where \(\Phi(\cdot)\) is the feature encoder, \(\Psi(\cdot)\) is the correlation calculation function, and \(H(\cdot)\) is the homography estimator. 2. **S - Adapter Fusion**: \[ bF^l = F^l + S\text{-}Adapter^l(F^l, f_s) \] where \(bF^l\) is the adapted feature at the \(l\) - th layer, \(F^l\) is the original feature, and \(f_s\) is the semantic information. 3. **Specific Formula of S - Adapter**: \[ S\text{-}Adapter^l(F^l, f_s) = \phi(G(F^l W_{dn}, f_s W_{dn})) W_{up} \] \[ G(X, Y) = X \odot \sigma(CAP(Y)) \] where \(W_{dn}\) and \(W_{up}\) are projection operations, \(\odot\) is the element - wise multiplication, \(\sigma\) is the Sigmoid function, and \(CAP(\cdot)\) represents channel - average pooling. 4. **Cross - Modal Correlation Modeling**: \[ f_{inter}^i = C(f_{rgb}^i \odot M(I_t') \odot f_s, f_t^i) \] \[ C(Q, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right)V + Q \] 5. **Intra - Modal Correlation Modeling**: \[ f_{intra}^i = C(f_{rgb}^i + f_{inter}^i, f_{rgb}^i + f_{inter}^i) \] Through the above methods, the paper not only solves the salient object detection problem of unaligned RGB - T image pairs, but also provides rich resources and effective technical means for future research.

Alignment-Free RGB-T Salient Object Detection: A Large-scale Dataset and Progressive Correlation Network

Alignment-Free RGBT Salient Object Detection: Semantics-guided Asymmetric Correlation Network and A Unified Benchmark

Position-Aware Relation Learning for RGB-Thermal Salient Object Detection

Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection

Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

ADNet: An Asymmetric Dual-Stream Network for RGB-T Salient Object Detection.

Multi-interactive Dual-decoder for RGB-thermal Salient Object Detection

Salient Object Detection in RGB-D Videos

ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection

SIA: RGB-T Salient Object Detection Network with Salient-Illumination Awareness

Salient Object Detection Based on Visual Perceptual Saturation and Two-Stream Hybrid Networks.

RGB-D Salient Object Detection with Ubiquitous Target Awareness

TANet: Transformer-based Asymmetric Network for RGB-D Salient Object Detection

A Unified RGB-T Saliency Detection Benchmark: Dataset, Baselines, Analysis and A Novel Approach

Efficient Context-Guided Stacked Refinement Network for RGB-T Salient Object Detection

Real-Time One-Stream Semantic-Guided Refinement Network for RGB-Thermal Salient Object Detection

Densely Deformable Efficient Salient Object Detection Network

Interactive Context-Aware Network for RGB-T Salient Object Detection

A Unified Structure for Efficient RGB and RGB-D Salient Object Detection

RGB-D salient object detection: A survey