Abstract:Combining the complementary benefits of frames and events has been widely used for object detection in challenging scenarios. However, most object detection methods use two independent Artificial Neural Network (ANN) branches, limiting cross-modality information interaction across the two visual streams and encountering challenges in extracting temporal cues from event streams with low power consumption. To address these challenges, we propose HDI-Former, a Hybrid Dynamic Interaction ANN-SNN Transformer, marking the first trial to design a directly trained hybrid ANN-SNN architecture for high-accuracy and energy-efficient object detection using frames and events. Technically, we first present a novel semantic-enhanced self-attention mechanism that strengthens the correlation between image encoding tokens within the ANN Transformer branch for better performance. Then, we design a Spiking Swin Transformer branch to model temporal cues from event streams with low power consumption. Finally, we propose a bio-inspired dynamic interaction mechanism between ANN and SNN sub-networks for cross-modality information interaction. The results demonstrate that our HDI-Former outperforms eleven state-of-the-art methods and our four baselines by a large margin. Our SNN branch also shows comparable performance to the ANN with the same architecture while consuming 10.57$\times$ less energy on the DSEC-Detection dataset. Our open-source code is available in the supplementary material.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to efficiently combine the information of two heterogeneous visual streams, frames and events, in object detection tasks to achieve high - precision and low - energy - consumption object detection. Specifically, existing multi - modal object detection methods usually use two independent artificial neural network (ANN) branches to process frames and events, which limits the cross - modal information interaction and leads to high energy consumption. In addition, convolutional neural networks (CNNs) have problems of high computational complexity and large energy consumption when processing event streams. Therefore, this paper proposes a Hybrid Dynamic Interaction Artificial Neural Network - Spiking Neural Network Transformer (HDI - Former), aiming to overcome the above challenges by designing a new attention mechanism, an efficient Spiking Swin Transformer branch, and a bio - inspired dynamic interaction mechanism, thereby achieving high - performance and low - energy - consumption object detection. ### Main contributions of the paper: 1. **Proposed a directly - trained Hybrid Dynamic Interaction Artificial Neural Network - Spiking Neural Network Transformer (HDI - Former)**: This model can achieve high precision and low energy consumption when using frames and events for object detection. 2. **Introduced a semantically - enhanced self - attention mechanism**: It enhances the correlation between image - encoding tokens in the ANN branch and improves the detection performance. 3. **Designed an efficient Spiking Swin Transformer branch**: This branch can utilize the rich temporal cues in the event stream while maintaining performance comparable to the corresponding ANN, but with an energy consumption only 1/10.57 of that of the ANN. 4. **Introduced a bio - inspired dynamic interaction mechanism**: It realizes cross - modal information interaction between ANN and SNN sub - networks and makes full use of the complementary characteristics of frames and events. ### Experimental results: - **Frame modality**: The proposed SEST branch significantly outperforms existing frame - based methods, such as Faster R - CNN, RetinaNet, etc., on multiple benchmark datasets. - **Event modality**: The Spiking Swin Transformer branch performs significantly better than existing SNN methods, such as EMS - YOLO, on the event modality and significantly reduces energy consumption while maintaining performance. - **Advantages of multi - modal fusion**: HDI - Former performs excellently in multi - modal fusion, significantly outperforming other multi - modal methods (such as FPN - fusion, SFNet, and SODFormer), and has obvious advantages in terms of energy consumption. In conclusion, through innovative design and optimization, this paper successfully addresses the challenge of efficiently combining frame and event information in object detection tasks and provides a new solution for low - energy - consumption, high - precision object detection.

HDI-Former: Hybrid Dynamic Interaction ANN-SNN Transformer for Object Detection Using Frames and Events

A Transformer-Based Object Detector with Coarse-Fine Crossing Representations

A Hybrid SNN-ANN Network for Event-based Object Detection with Spatial and Temporal Attention

Hybrid SNN-ANN: Energy-Efficient Classification and Object Detection for Event-Based Vision

SODFormer: Streaming Object Detection with Transformer Using Events and Frames

An Event-Driven Object Recognition Model Using Activated Connected Domain Detection

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

SpikingViT: a Multi-scale Spiking Vision Transformer Model for Event-based Object Detection

HFSI-TF: Hierarchical Full-Scale Interactive Transformer Model for Object Detection in Remote Sensing Image

Embracing Events and Frames with Hierarchical Feature Refinement Network for Object Detection

Hyneter: Hybrid Network Transformer for Object Detection

An Effective and Lightweight Hybrid Network for Object Detection in Remote Sensing Images

Human–object interaction detection based on disentangled axial attention transformer

ASAN: Self-Attending and Semantic Activating Network Towards Better Object Detection

HA-DQS-Net: dynamic query design based on transformer with hollow attention

Best of Both Worlds: Hybrid SNN-ANN Architecture for Event-based Optical Flow Estimation

Spiking Neural Network for Ultralow-Latency and High-Accurate Object Detection

Scene Adaptive Sparse Transformer for Event-based Object Detection

Dual Attention Based Image Pyramid Network for Object Detection.

Hybrid multi-attention transformer for robust video object detection