MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

Xiangbo Gao,Asiegbu Miracle Kanu-Asiegbu,Xiaoxiao Du
2024-08-02
Abstract:This paper proposes MambaST, a plug-and-play cross-spectral spatial-temporal fusion pipeline for efficient pedestrian detection. Several challenges exist for pedestrian detection in autonomous driving applications. First, it is difficult to perform accurate detection using RGB cameras under dark or low-light conditions. Cross-spectral systems must be developed to integrate complementary information from multiple sensor modalities, such as thermal and visible cameras, to improve the robustness of the detections. Second, pedestrian detection models are latency-sensitive. Efficient and easy-to-scale detection models with fewer parameters are highly desirable for real-time applications such as autonomous driving. Third, pedestrian video data provides spatial-temporal correlations of pedestrian movement. It is beneficial to incorporate temporal as well as spatial information to enhance pedestrian detection. This work leverages recent advances in the state space model (Mamba) and proposes a novel Multi-head Hierarchical Patching and Aggregation (MHHPA) structure to extract both fine-grained and coarse-grained information from both RGB and thermal imagery. Experimental results show that the proposed MHHPA is an effective and efficient alternative to a Transformer model for cross-spectral pedestrian detection. Our proposed model also achieves superior performance on small-scale pedestrian detection. The code is available at <a class="link-external link-https" href="https://github.com/XiangboGaoBarry/MambaST" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/XiangboGaoBarry/MambaST" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges faced by pedestrian detection in autonomous driving applications: 1. **Accurate Detection under Low - light Conditions**: - It is difficult to perform accurate pedestrian detection using only RGB cameras at night or under low - light conditions. Therefore, it is necessary to develop cross - spectral systems that combine complementary information from multiple sensors (such as thermal imaging and visible - light cameras) to improve the robustness of detection. 2. **Efficiency of Real - time Detection**: - Pedestrian detection models are sensitive to latency, especially in real - time application scenarios such as autonomous driving. Therefore, efficient detection models that are easy to scale and have fewer parameters are required to meet real - time needs. 3. **Fusion of Spatio - temporal Information**: - Pedestrian video data provides the spatio - temporal correlation of pedestrians. Combining spatio - temporal information can enhance the effect of pedestrian detection. Existing multi - modal fusion methods mainly focus on the spatial fusion of a single frame and lack comprehensive consideration of spatio - temporal information. To solve the above problems, the paper proposes a novel cross - spectral spatio - temporal fusion pipeline named MambaST, which is based on the state - space model (Mamba) and extracts fine - grained and coarse - grained information from RGB and thermal - imaging images by introducing the Multi - Head Hierarchical Patching and Aggregation (MHHPA) structure. Experimental results show that the MHHPA module is more effective and efficient than Transformer models in cross - spectral pedestrian detection, especially in small - scale pedestrian detection. In summary, by proposing the MambaST model, this paper aims to improve the accuracy of pedestrian detection under low - light conditions, improve the real - time processing efficiency of the model, and effectively fuse spatio - temporal information to enhance the detection effect.