Efficient dual attention SlowFast networks for video action recognition

Dafeng Wei,Ye Tian,Liqing Wei,Hong Zhong,Siqian Chen,Shiliang Pu,Hongtao Lu
DOI: https://doi.org/10.1016/j.cviu.2022.103484
IF: 4.886
2022-01-01
Computer Vision and Image Understanding
Abstract:Video data mainly differ in temporal dimension compared with static image data. Various video action recognition networks choose two-stream models to learn spatial and temporal information separately and fuse them to further improve performance. We proposed a cross-modality dual attention fusion module named CMDA to explicitly exchange spatial-temporal information between two pathways in two-stream SlowFast networks. Besides, considering the computational complexity of these heavy models and the low accuracy of existing lightweight models, we proposed several two-stream efficient SlowFast networks based on well designed efficient 2D networks, such as GhostNet, ShuffleNetV2 and so on. Experiments demonstrate that our proposed fusion model CMDA improves the performance of SlowFast, and our efficient two-stream models achieve a consistent increase in accuracy with a little overhead in FLOPs. Our code and pre-trained models will be made available at https://github.com/weidafeng/Efficient-SlowFast.
What problem does this paper attempt to address?