Voxel-based Multi-scale Transformer Network for Event Stream Processing

Daikun Liu,Teng Wang,Changyin Sun
DOI: https://doi.org/10.1109/tcsvt.2023.3301176
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Event cameras are bio-inspired dynamic vision sensors that are superior to frame-based cameras in terms of low power consumption, high dynamic range, and high temporal resolution in computer vision tasks. Recent advances in voxel-based representation learning have successfully exploited the sparsity of events with low computational complexity, but face challenges in extracting spatio-temporal features within voxels and representative global dependencies between voxels, thus limiting their representation power. In this work, towards a better trade-off between accuracy and computation overhead, we propose a novel voxel-based multi-scale transformer network (VMST-Net) to process event streams. Specifically, VMST-Net projects events within voxels into multi-channel frames along the time axis, such that 2D convolutions could be leveraged to encode spatio-temporal features in voxels. Then, VMST-Net utilizes a novel multi-scale multi-head self-attention (MSMHSA) mechanism with a multi-scale fusion (MSF) module that allows different heads within each layer to attend different scale 3D neighborhoods to adaptively aggregate the coarse-to-fine voxel features with little computational costs and parameters. Moreover, to model effective global features while saving computations, we aggregate features in a local-to-global manner by enlarging the coverage of 3D neighborhoods as the network gets deeper. Extensive experimental results on benchmark datasets demonstrate that our model advances state-of-the-art accuracy with low model complexity and computational complexity in all three visual tasks, including object classification, action recognition, and human pose estimation.
What problem does this paper attempt to address?