HDI-Former: Hybrid Dynamic Interaction ANN-SNN Transformer for Object Detection Using Frames and Events

Dianze Li,Jianing Li,Xu Liu,Zhaokun Zhou,Xiaopeng Fan,Yonghong Tian
2024-11-27
Abstract:Combining the complementary benefits of frames and events has been widely used for object detection in challenging scenarios. However, most object detection methods use two independent Artificial Neural Network (ANN) branches, limiting cross-modality information interaction across the two visual streams and encountering challenges in extracting temporal cues from event streams with low power consumption. To address these challenges, we propose HDI-Former, a Hybrid Dynamic Interaction ANN-SNN Transformer, marking the first trial to design a directly trained hybrid ANN-SNN architecture for high-accuracy and energy-efficient object detection using frames and events. Technically, we first present a novel semantic-enhanced self-attention mechanism that strengthens the correlation between image encoding tokens within the ANN Transformer branch for better performance. Then, we design a Spiking Swin Transformer branch to model temporal cues from event streams with low power consumption. Finally, we propose a bio-inspired dynamic interaction mechanism between ANN and SNN sub-networks for cross-modality information interaction. The results demonstrate that our HDI-Former outperforms eleven state-of-the-art methods and our four baselines by a large margin. Our SNN branch also shows comparable performance to the ANN with the same architecture while consuming 10.57$\times$ less energy on the DSEC-Detection dataset. Our open-source code is available in the supplementary material.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to efficiently combine the information of two heterogeneous visual streams, frames and events, in object detection tasks to achieve high - precision and low - energy - consumption object detection. Specifically, existing multi - modal object detection methods usually use two independent artificial neural network (ANN) branches to process frames and events, which limits the cross - modal information interaction and leads to high energy consumption. In addition, convolutional neural networks (CNNs) have problems of high computational complexity and large energy consumption when processing event streams. Therefore, this paper proposes a Hybrid Dynamic Interaction Artificial Neural Network - Spiking Neural Network Transformer (HDI - Former), aiming to overcome the above challenges by designing a new attention mechanism, an efficient Spiking Swin Transformer branch, and a bio - inspired dynamic interaction mechanism, thereby achieving high - performance and low - energy - consumption object detection. ### Main contributions of the paper: 1. **Proposed a directly - trained Hybrid Dynamic Interaction Artificial Neural Network - Spiking Neural Network Transformer (HDI - Former)**: This model can achieve high precision and low energy consumption when using frames and events for object detection. 2. **Introduced a semantically - enhanced self - attention mechanism**: It enhances the correlation between image - encoding tokens in the ANN branch and improves the detection performance. 3. **Designed an efficient Spiking Swin Transformer branch**: This branch can utilize the rich temporal cues in the event stream while maintaining performance comparable to the corresponding ANN, but with an energy consumption only 1/10.57 of that of the ANN. 4. **Introduced a bio - inspired dynamic interaction mechanism**: It realizes cross - modal information interaction between ANN and SNN sub - networks and makes full use of the complementary characteristics of frames and events. ### Experimental results: - **Frame modality**: The proposed SEST branch significantly outperforms existing frame - based methods, such as Faster R - CNN, RetinaNet, etc., on multiple benchmark datasets. - **Event modality**: The Spiking Swin Transformer branch performs significantly better than existing SNN methods, such as EMS - YOLO, on the event modality and significantly reduces energy consumption while maintaining performance. - **Advantages of multi - modal fusion**: HDI - Former performs excellently in multi - modal fusion, significantly outperforming other multi - modal methods (such as FPN - fusion, SFNet, and SODFormer), and has obvious advantages in terms of energy consumption. In conclusion, through innovative design and optimization, this paper successfully addresses the challenge of efficiently combining frame and event information in object detection tasks and provides a new solution for low - energy - consumption, high - precision object detection.