Abstract:Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve two main problems existing in the current visual Mamba models when processing images: 1. **Balance between local and global information**: Existing methods usually adopt patch - based image tokenization and flatten it into a one - dimensional sequence for causal processing. This method ignores the inherent two - dimensional structural correlation in the image and has difficulty in extracting global information, resulting in each image token containing only local information and lacking the capture of global features. 2. **Lack of causality**: When image tokens are flattened in rows or columns, the inherent causal order in the image is destroyed. Adjacent regions usually encode similar visual information in the spatial domain, while regions farther apart may show significant differences. Therefore, the direct token flattening process cannot provide an appropriate image modeling order, affecting the effectiveness of the Mamba architecture. To solve these problems, the authors propose **GlobalMamba**, an improved visual Mamba model based on global image serialization. Specifically, the authors solve the problems through the following steps: - **Frequency - domain transformation**: First, transform the image from the spatial domain to the frequency domain using the discrete cosine transform (DCT). The formula is as follows: \[ F(u, v)=\alpha(u) \alpha(v) \sum_{i = 0}^{h - 1} \sum_{j = 0}^{w - 1} x(i, j)\cos\left(\frac{(2i + 1)u\pi}{2h}\right)\cos\left(\frac{(2j + 1)v\pi}{2w}\right) \] where $\alpha(u)$ and $\alpha(v)$ are scaling factors, defined as: \[ \alpha(u)=\begin{cases} \frac{1}{\sqrt{h}} & \text{if } u = 0\\ \frac{1}{\sqrt{2h}} & \text{otherwise} \end{cases} \] - **Frequency segmentation**: Divide the frequency spectrum into multiple intervals, from low - frequency to high - frequency. The pixels in each interval are rearranged to ensure that the global information is retained. Specifically, the distance from the $k$ - th segmentation point to the upper - left corner is $\frac{1}{2K - k}$ of the entire diagonal length. - **Inverse transformation and down - sampling**: Project the segmented frequency - spectrum representation back to the spatial domain through the inverse discrete cosine transform (IDCT) to obtain a series of images with different frequency ranges. Then perform spatial down - sampling according to the frequency range of each sample, expressed as: \[ x'_k = G(x_k,\frac{h}{2K - k},\frac{w}{2K - k}) \] where $G(\cdot)$ represents the down - sampling interpolation function. - **Global tokenization**: Finally, divide these image samples into patches through a lightweight CNN and a linear module, and organize the extracted tokens into a causally ordered sequence according to the frequency order. Through this method, GlobalMamba can better utilize the causal relationships in the image and capture global features at the same time, thus showing better performance in various visual tasks. Experimental results show that GlobalMamba has achieved significant improvements in tasks such as ImageNet - 1K image classification, COCO object detection, and ADE20K semantic segmentation.

GlobalMamba: Global Image Serialization for Vision Mamba

V2M: Visual 2-Dimensional Mamba for Image Representation Learning

Mamba-R: Vision Mamba ALSO Needs Registers

LoG-VMamba: Local-Global Vision Mamba for Medical Image Segmentation

VMamba: Visual State Space Model

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

2DMamba: Efficient State Space Model for Image Representation with Applications on Giga-Pixel Whole Slide Image Classification

LocalMamba: Visual State Space Model with Windowed Selective Scan

Mamba in Vision: A Comprehensive Survey of Techniques and Applications

QuadMamba: Learning Quadtree-based Selective Scan for Visual State Space Model

MHS-VM: Multi-Head Scanning in Parallel Subspaces for Vision Mamba

Why mamba is effective? Exploit Linear Transformer-Mamba Network for Multi-Modality Image Fusion

GraphMamba: An Efficient Graph Structure Learning Vision Mamba for Hyperspectral Image Classification

MambaVC: Learned Visual Compression with Selective State Spaces

Spatial-Mamba: Effective Visual State Space Models via Structure-Aware State Fusion

Mamba-UNet: UNet-Like Pure Visual Mamba for Medical Image Segmentation

Multi-Scale VMamba: Hierarchy in Hierarchy Visual State Space Model

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Mamba-in-Mamba: Centralized Mamba-Cross-Scan in Tokenized Mamba Model for Hyperspectral Image Classification