An Efficient Motion Visual Learning Method for Video Action Recognition

Bin Wang,Faliang Chang,Chunsheng Liu,Wenqian Wang,Ruiyi Ma
DOI: https://doi.org/10.1016/j.eswa.2024.124596
IF: 8.5
2024-01-01
Expert Systems with Applications
Abstract:Currently, efficient spatio-temporal information modeling is one of the key research components to solve the action recognition problem. Previous approaches focus on enhancing the backbone features individually using hierarchical structures, and unfortunately, most of them fail to achieve a better balance between the interactional adequacy of features within the structure. In this work, we propose an effective Multi-dimensional Adaptive Fusion Network (MDAF-Net), which can be embedded into the mainstream action recognition backbone in a plug-and-play manner to fully activate the transfer and representation of action features in the deep network. Specifically, our MDAF-Net contains two main components: the Adaptive Temporal Capture Module (ATCM) and the Extended Spatial and Channel Module (ESCM). The ATCM effectively suppresses the over-expression of similar features in adjacent frames and activates the expression of motion flow information. The ESCM further improves temporal modeling efficiency by extending the spatial feature perceptual field and enhancing channel attention. Extensive experiments on several challenging action recognition benchmarks, such as Something-Something V1&V2 and Kinetics-400, demonstrate that the proposed MDAF can achieve state-of-the-art and competitive performance.
What problem does this paper attempt to address?