Spatio-Temporal Aggregation Transformer for Object Detection with Neuromorphic Vision Sensors

Zhaoxuan Guo,Jiandong Gao,Guangyuan Ma,Jiangtao Xu
DOI: https://doi.org/10.1109/jsen.2024.3392973
IF: 4.3
2024-01-01
IEEE Sensors Journal
Abstract:To enhance the accuracy of object detection with event-based neuromorphic vision sensors, a novel event-based detector named Spatio-Temporal Aggregation Transformer (STAT) is proposed. Firstly, in order to collect sufficient event information to estimate the problem considered, STAT uses a density-based adaptive sampling (DAS) module to sample continuous event stream into multiple groups adaptively. This module can determine the sampling termination condition by quantifying the velocity and size of objects. Secondly, STAT integrates a sparse event tensor (SET) to establish compatibility between event stream and traditional vision algorithms. SET maps events to a dense representation by end-to-end fitting the optimal mapping function, mitigating the loss of spatiotemporal information within the event stream. At last, in order to enhance the features of slowly moving objects, a lightweight and efficient triaxial vision transformer (TVT) is designed for modeling global features and integrating historical motion information. Experimental evaluations on two benchmark datasets show that the performance of STAT achieves a mean average precision (mAP) of 68.2% and 49.9% on the N-caltech101 dataset and the Gen1 dataset, respectively. These results demonstrate that the detection accuracy of STAT outperforms the state-of-the-art methods by 2.0% on the Gen1 dataset. The code of this project is available at https://github.com/TJU-guozhaoxuan/STAT.
What problem does this paper attempt to address?