MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection

Tianxiang Chen,Zi Ye,Zhentao Tan,Tao Gong,Yue Wu,Qi Chu,Bin Liu,Nenghai Yu,Jieping Ye
2024-06-24
Abstract:Recently, infrared small target detection (ISTD) has made significant progress, thanks to the development of basic models. Specifically, the models combining CNNs with transformers can successfully extract both local and global features. However, the disadvantage of the transformer is also inherited, i.e., the quadratic computational complexity to sequence length. Inspired by the recent basic model with linear complexity for long-distance modeling, Mamba, we explore the potential of this state space model for ISTD task in terms of effectiveness and efficiency in the paper. However, directly applying Mamba achieves suboptimal performances due to the insufficient harnessing of local features, which are imperative for detecting small targets. Instead, we tailor a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient ISTD. It consists of Outer and Inner Mamba blocks to adeptly capture both global and local features. Specifically, we treat the local patches as "visual sentences" and use the Outer Mamba to explore the global information. We then decompose each visual sentence into sub-patches as "visual words" and use the Inner Mamba to further explore the local information among words in the visual sentence with negligible computational costs. By aggregating the visual word and visual sentence features, our MiM-ISTD can effectively explore both global and local information. Experiments on NUAA-SIRST and IRSTD-1k show the superior accuracy and efficiency of our method. Specifically, MiM-ISTD is $8 \times$ faster than the SOTA method and reduces GPU memory usage by 62.2$\%$ when testing on $2048 \times 2048$ images, overcoming the computation and memory constraints on high-resolution infrared images.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper focuses on the problem of infrared small target detection (ISTD), which is a binary segmentation task widely used in remote sensing and military tracking systems. Current methods mainly consist of traditional algorithms and deep learning methods, with deep learning methods such as Convolutional Neural Networks (CNN) improving performance but lacking in capturing global information, making it easy to miss small targets. On the other hand, methods combining CNN and Transformer can handle long-range dependencies but have high computational complexity. The paper proposes a new model structure called Mamba-in-Mamba (MiM-ISTD) for effective and efficient infrared small target detection. MiM-ISTD consists of two Mamba blocks, inner and outer, which can capture both global and local features. It divides the image into "visual sentences" and "visual words", where the outer Mamba block processes global information and the inner Mamba block further explores local information within each "visual sentence" to capture key local features with lower computational cost. Experimental results show that MiM-ISTD achieves superior accuracy and efficiency on the NUAA-SIRST and IRSTD-1k datasets, with a speed improvement of 8 times and a 62.2% reduction in GPU memory usage compared to existing methods. In summary, the paper aims to address the efficiency and accuracy issues in infrared small target detection by introducing the linear complexity Mamba model and making improvements, improving accuracy in small target detection while reducing computational and memory consumption.