Abstract:Visual attention modeling, important for interpreting and prioritizing visual stimuli, plays a significant role in applications such as marketing, multimedia, and robotics. Traditional saliency prediction models, especially those based on Convolutional Neural Networks (CNNs) or Transformers, achieve notable success by leveraging large-scale annotated datasets. However, the current state-of-the-art (SOTA) models that use Transformers are computationally expensive. Additionally, separate models are often required for each image type, lacking a unified approach. In this paper, we propose Saliency Unification through Mamba (SUM), a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net to provide a unified model for diverse image types. Using a novel Conditional Visual State Space (C-VSS) block, SUM dynamically adapts to various image types, including natural scenes, web pages, and commercial imagery, ensuring universal applicability across different data types. Our comprehensive evaluations across five benchmarks demonstrate that SUM seamlessly adapts to different visual characteristics and consistently outperforms existing models. These results position SUM as a versatile and powerful tool for advancing visual attention modeling, offering a robust solution universally applicable across different types of visual content.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems existing in current visual saliency prediction models: 1. **High computational cost**: Existing state - of - the - art (SOTA) models based on Transformer have a very high computational cost when dealing with visual saliency prediction because the computational complexity of their self - attention mechanism is quadratic with respect to the image size. This makes these models difficult to be applied to dense prediction tasks. 2. **Lack of unity**: Current saliency prediction models are usually designed for specific types of images, such as natural scenes, web pages, and commercial images, etc. Different types of images require different models, and there is a lack of a unified model that can adapt to multiple image types. This limits the universality and application scope of the models. 3. **Insufficient adaptability**: Different image types (such as natural scenes, e - commerce images, and user interfaces) have different saliency characteristics, and existing models cannot adapt well to these differences, resulting in inconsistent performance on different datasets. To solve these problems, the paper proposes a new model - **Saliency Unification through Mamba (SUM)**. SUM provides a unified model that can adapt to multiple image types by combining the efficient long - distance dependency modeling ability of the Mamba model and the U - Net architecture. Specifically, SUM introduces the following innovations: - **Linear computational complexity**: Utilize the linear computational complexity characteristics of the Mamba model to effectively capture long - distance information and reduce the computational cost. - **Conditional Visual State Space (C - VSS) module**: Dynamically adjust the model behavior to enable it to adapt to the unique characteristics of different types of images, enhancing the adaptability and generalization ability of the model. - **Extensive evaluation**: Conducted a comprehensive evaluation on multiple benchmark datasets, demonstrating the excellent performance of SUM on different visual contents. Through these improvements, SUM not only improves the accuracy of saliency prediction but also provides a general solution that can maintain high performance on various image types.

SUM: Saliency Unification through Mamba for Visual Attention Modeling

A Survey on Visual Mamba

CSA-Net: Deep Cross-Complementary Self Attention and Modality-Specific Preservation for Saliency Detection

Unified Image and Video Saliency Modeling

SiMBA: Simplified Mamba-Based Architecture for Vision and Multivariate Time series

VMamba: Visual State Space Model

VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

What Do Deep Saliency Models Learn about Visual Attention?

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

A Survey on Vision Mamba: Models, Applications and Challenges

Scanpath Prediction on Information Visualisations

EfficientVMamba: Atrous Selective Scan for Light Weight Visual Mamba

Visual Mamba: A Survey and New Outlooks

Demystify Mamba in Vision: A Linear Attention Perspective

Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation

GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

CM-UNet: Hybrid CNN-Mamba UNet for Remote Sensing Image Semantic Segmentation

HMT-UNet: A hybird Mamba-Transformer Vision UNet for Medical Image Segmentation