EAR: Efficient Action Recognition with Local-Global Temporal Aggregation

Can Zhang,Yuexian Zou,Guang Chen,Lei Gan
DOI: https://doi.org/10.1016/j.imavis.2021.104329
IF: 3.86
2021-01-01
Image and Vision Computing
Abstract:Temporal modeling in videos is crucial for action recognition. Traditionally, it involves feature aggregation for both local motion and global semantic. In this paper, we propose an Efficient Action Recognition network (EAR), which includes a Persistence of Appearance (PA) module anda Various-timescale Aggregation (VA) module for local and global temporal aggregations respectively. For local motion aggregation, instead of using the previ-ous time-consuming optical flow, our PA calculates pixel-wise differences in feature space as the motion repre-sentation, which is much more efficient (8196 fps vs. 8 fps in optical flow). Besides, to capture global semantic hints, we propose VA module which adaptively emphasizes expressive features and suppresses less informative ones across various timescales. Empowered by the local-global temporal aggregation, our EAR achieves compet-itive results on six challenging action recognition benchmarks at low FLOPs. (c) 2021 Elsevier B.V. All rights reserved.
What problem does this paper attempt to address?