Abstract:Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency-based analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research. Our code and models will be available at <a class="link-external link-https" href="https://github.com/HashmatShadab/MambaRobustness" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to evaluate the robustness of Vision State Space Models (VSSMs) in the face of natural and adversarial perturbations. Specifically, the paper explores the performance of VSSMs in the following aspects: 1. **Information loss and occlusion**: - The paper evaluates the performance of VSSMs in handling information loss in the scanning direction, severe occlusion, and random patch dropping. - These experiments help to understand whether VSSMs can maintain their performance in the case of partial information loss and what kind of robustness they show under different types of occlusion. 2. **Common perturbations**: - The researchers evaluate the performance of VSSMs in dealing with common image perturbations (such as noise, blurring, weather changes, etc.). - The experiments cover global perturbations and fine - grained perturbations (such as object - property editing and background manipulation) to simulate various situations in the real world. 3. **Adversarial attacks**: - The paper analyzes the robustness of VSSMs in adversarial attacks in white - box and black - box settings. - Through frequency analysis, the resistance of VSSMs to low - frequency and high - frequency adversarial perturbations is studied. 4. **Comparison with existing models**: - The paper compares the performance of VSSMs with existing Convolutional Neural Networks (CNNs) and Transformers to evaluate their relative advantages and limitations. - The experiments involve classification, detection, and segmentation tasks, and multiple benchmark datasets are used to ensure the comprehensiveness and reliability of the evaluation. ### Formula representation The State Space Models (SSMs) involved in the paper can be represented by the following formulas: Continuous - time state space model: \[ h'(t) = A h(t) + B x(t), \quad y(t) = C h(t) \] where \( A \in \mathbb{R}^{N \times N} \), \( B \in \mathbb{R}^{N \times 1} \), \( C \in \mathbb{R}^{N \times 1} \) are continuous parameters that control the dynamics and output mapping. Discretized form: \[ h_t = A h_{t - 1} + B x_t, \quad y_t = C h_t \] In addition, a global convolution operation is introduced to accelerate the calculation: \[ y = x \circledast K, \quad K = (CB, CAB, \ldots, CA^{L - 1}B) \] where \( K \in \mathbb{R}^L \) is the convolution kernel, and \(\circledast\) represents the convolution operator. ### Summary The main purpose of this paper is to systematically evaluate the robustness of VSSMs under various perturbation conditions, thereby providing valuable insights into the reliability and applicability of these models in practical applications. By comparing with existing models, the researchers reveal the advantages and disadvantages of VSSMs, providing a reference for future research.

Towards Evaluating the Robustness of Visual State Space Models

Understanding Robustness of Visual State Space Models for Image Classification

Exploring Robustness of Visual State Space model against Backdoor Attacks

On Evaluating Adversarial Robustness of Volumetric Medical Segmentation Models

Robustness Analysis on Foundational Segmentation Models

Large-scale Robustness Analysis of Video Action Recognition Models

Strengthening Robustness Under Adversarial Attacks Using Brain Visual Codes

Robustness Analysis of Video-Language Models Against Visual and Language Perturbations

Assessing the Robustness of Visual Question Answering Models

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

BadScan: An Architectural Backdoor Attack on Visual State Space Models

Understanding Adversarial Robustness of Vision Transformers via Cauchy Problem

On Inherent Adversarial Robustness of Active Vision Systems

VSSD: Vision Mamba with Non-Causal State Space Duality

Visual Robustness Benchmark for Visual Question Answering (VQA)

GroupMamba: Parameter-Efficient and Accurate Group Visual State Space Model

Exploring the Adversarial Robustness of Video Object Segmentation Via One-shot Adversarial Attacks

Impact of Architectural Modifications on Deep Learning Adversarial Robustness

Quantifying the robustness of deep multispectral segmentation models against natural perturbations and data poisoning

Towards Robustness against Unsuspicious Adversarial Examples