MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

Xiangbo Gao,Asiegbu Miracle Kanu-Asiegbu,Xiaoxiao Du

2024-08-02

Abstract:This paper proposes MambaST, a plug-and-play cross-spectral spatial-temporal fusion pipeline for efficient pedestrian detection. Several challenges exist for pedestrian detection in autonomous driving applications. First, it is difficult to perform accurate detection using RGB cameras under dark or low-light conditions. Cross-spectral systems must be developed to integrate complementary information from multiple sensor modalities, such as thermal and visible cameras, to improve the robustness of the detections. Second, pedestrian detection models are latency-sensitive. Efficient and easy-to-scale detection models with fewer parameters are highly desirable for real-time applications such as autonomous driving. Third, pedestrian video data provides spatial-temporal correlations of pedestrian movement. It is beneficial to incorporate temporal as well as spatial information to enhance pedestrian detection. This work leverages recent advances in the state space model (Mamba) and proposes a novel Multi-head Hierarchical Patching and Aggregation (MHHPA) structure to extract both fine-grained and coarse-grained information from both RGB and thermal imagery. Experimental results show that the proposed MHHPA is an effective and efficient alternative to a Transformer model for cross-spectral pedestrian detection. Our proposed model also achieves superior performance on small-scale pedestrian detection. The code is available at <a class="link-external link-https" href="https://github.com/XiangboGaoBarry/MambaST" rel="external noopener nofollow">this https URL</a>}{<a class="link-external link-https" href="https://github.com/XiangboGaoBarry/MambaST" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges faced by pedestrian detection in autonomous driving applications: 1. **Accurate Detection under Low - light Conditions**: - It is difficult to perform accurate pedestrian detection using only RGB cameras at night or under low - light conditions. Therefore, it is necessary to develop cross - spectral systems that combine complementary information from multiple sensors (such as thermal imaging and visible - light cameras) to improve the robustness of detection. 2. **Efficiency of Real - time Detection**: - Pedestrian detection models are sensitive to latency, especially in real - time application scenarios such as autonomous driving. Therefore, efficient detection models that are easy to scale and have fewer parameters are required to meet real - time needs. 3. **Fusion of Spatio - temporal Information**: - Pedestrian video data provides the spatio - temporal correlation of pedestrians. Combining spatio - temporal information can enhance the effect of pedestrian detection. Existing multi - modal fusion methods mainly focus on the spatial fusion of a single frame and lack comprehensive consideration of spatio - temporal information. To solve the above problems, the paper proposes a novel cross - spectral spatio - temporal fusion pipeline named MambaST, which is based on the state - space model (Mamba) and extracts fine - grained and coarse - grained information from RGB and thermal - imaging images by introducing the Multi - Head Hierarchical Patching and Aggregation (MHHPA) structure. Experimental results show that the MHHPA module is more effective and efficient than Transformer models in cross - spectral pedestrian detection, especially in small - scale pedestrian detection. In summary, by proposing the MambaST model, this paper aims to improve the accuracy of pedestrian detection under low - light conditions, improve the real - time processing efficiency of the model, and effectively fuse spatio - temporal information to enhance the detection effect.

MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection

Towards Accurate Dense Pedestrian Detection Via Occlusion-Prediction Aware Label Assignment and Hierarchical-Nms.

Online Multipedestrian Tracking Based on Fused Detections of Millimeter Wave Radar and Vision

Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection

Transformer fusion and histogram layer multispectral pedestrian detection network

Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection For Autonomous Driving

Adaptive Multi-Pedestrian Tracking by Multi-Sensor: Track-to-Track Fusion Using Monocular 3D Detection and MMW Radar

An Empirical Study of Mamba-based Pedestrian Attribute Recognition

Multi-Scale Structure Perception and Global Context-Aware Method for Small-Scale Pedestrian Detection

Multi-window Transformer Parallel Fusion Feature Pyramid Network for Pedestrian Orientation Detection

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

Illumination and Temperature-Aware Multispectral Networks for Edge-Computing-Enabled Pedestrian Detection

Cascaded information enhancement and cross-modal attention feature fusion for multispectral pedestrian detection

Pedestrian Detection Using Multi-Channel Visual Feature Fusion by Learning Deep Quality Model.

Too Far to See? Not Really! --- Pedestrian Detection with Scale-aware Localization Policy

Cross-Modality Proposal-guided Feature Mining for Unregistered RGB-Thermal Pedestrian Detection

TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection

Optimal Fusion-based Asymmetric Two-stream Networks for Multispectral Image Pedestrian Detection

MAF-YOLO: Multi-modal attention fusion based YOLO for pedestrian detection

Deep saliency detection-based pedestrian detection with multispectral multi-scale features fusion network