Unifying convolution and transformer: a dual stage network equipped with cross-interactive multi-modal feature fusion and edge guidance for RGB-D salient object detection
Shilpa Elsa Abraham,Binsu C. Kovoor,Abraham, Shilpa Elsa
DOI: https://doi.org/10.1007/s12652-024-04758-2
IF: 3.662
2024-03-03
Journal of Ambient Intelligence and Humanized Computing
Abstract:RGB-D salient object detection (SOD) has received immense research interest in the recent past due to its capability to handle complex and challenging image scenes. Despite substantial endeavors in this field, two notable challenges endure. The first pertains to the efficient extraction of saliency-relevant features from both modalities, necessitating a comprehensive understanding and capture of intricate features present in RGB and depth views. To tackle this, a novel approach is proposed, which encompasses convolutional neural network (CNN) and transformer architecture into a unified framework. This effectively captures spatial hierarchies and intricate dependencies, facilitating enhanced feature extraction and deeper contextual understanding. The second hurdle involves devising an optimal fusion strategy for the RGB and depth views, which is handled by a specialized cross-interactive multi-modal (CIMM) feature fusion module. Constructed with two stages of self-attention, this module generates a unified feature representation, effectively bridging the gap between the two modalities. Further, to emphasize the delineation of salient objects' boundaries, an edge enhancement module (EEM) is incorporated that enhances the visual distinctiveness of salient objects, thereby improving the overall quality and accuracy of salient object detection in complex scenes. Extensive experimental evaluations performed on seven benchmark datasets demonstrate that the proposed model performs favourably against CNN and transformer based state-of-the-art methods under standard saliency evaluation metric. Notably, on NJU-2K test set, it achieves an S-measure of 0.933, F-measure of 0.930, E-measure of 0.951 and a remarkably low mean absolute error of 0.025, underscoring the efficacy of the proposed model.
computer science, information systems,telecommunications, artificial intelligence