SUM: Saliency Unification through Mamba for Visual Attention Modeling

Alireza Hosseini,Amirhossein Kazerouni,Saeed Akhavan,Michael Brudno,Babak Taati
2024-06-25
Abstract:Visual attention modeling, important for interpreting and prioritizing visual stimuli, plays a significant role in applications such as marketing, multimedia, and robotics. Traditional saliency prediction models, especially those based on Convolutional Neural Networks (CNNs) or Transformers, achieve notable success by leveraging large-scale annotated datasets. However, the current state-of-the-art (SOTA) models that use Transformers are computationally expensive. Additionally, separate models are often required for each image type, lacking a unified approach. In this paper, we propose Saliency Unification through Mamba (SUM), a novel approach that integrates the efficient long-range dependency modeling of Mamba with U-Net to provide a unified model for diverse image types. Using a novel Conditional Visual State Space (C-VSS) block, SUM dynamically adapts to various image types, including natural scenes, web pages, and commercial imagery, ensuring universal applicability across different data types. Our comprehensive evaluations across five benchmarks demonstrate that SUM seamlessly adapts to different visual characteristics and consistently outperforms existing models. These results position SUM as a versatile and powerful tool for advancing visual attention modeling, offering a robust solution universally applicable across different types of visual content.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems existing in current visual saliency prediction models: 1. **High computational cost**: Existing state - of - the - art (SOTA) models based on Transformer have a very high computational cost when dealing with visual saliency prediction because the computational complexity of their self - attention mechanism is quadratic with respect to the image size. This makes these models difficult to be applied to dense prediction tasks. 2. **Lack of unity**: Current saliency prediction models are usually designed for specific types of images, such as natural scenes, web pages, and commercial images, etc. Different types of images require different models, and there is a lack of a unified model that can adapt to multiple image types. This limits the universality and application scope of the models. 3. **Insufficient adaptability**: Different image types (such as natural scenes, e - commerce images, and user interfaces) have different saliency characteristics, and existing models cannot adapt well to these differences, resulting in inconsistent performance on different datasets. To solve these problems, the paper proposes a new model - **Saliency Unification through Mamba (SUM)**. SUM provides a unified model that can adapt to multiple image types by combining the efficient long - distance dependency modeling ability of the Mamba model and the U - Net architecture. Specifically, SUM introduces the following innovations: - **Linear computational complexity**: Utilize the linear computational complexity characteristics of the Mamba model to effectively capture long - distance information and reduce the computational cost. - **Conditional Visual State Space (C - VSS) module**: Dynamically adjust the model behavior to enable it to adapt to the unique characteristics of different types of images, enhancing the adaptability and generalization ability of the model. - **Extensive evaluation**: Conducted a comprehensive evaluation on multiple benchmark datasets, demonstrating the excellent performance of SUM on different visual contents. Through these improvements, SUM not only improves the accuracy of saliency prediction but also provides a general solution that can maintain high performance on various image types.