SSTA-Net: Self-supervised Spatio-Temporal Attention Network for Action Recognition.

Yihan Li,Wenwen Zhang,Zhao Pei
DOI: https://doi.org/10.1007/978-3-031-46308-2_32
2023-01-01
Abstract:Action recognition aims to identify the action categories and features in the video by analyzing the actions and behavior patterns that are significant to the development of intelligent security, automatic driving, smart home, and other fields. However, current methods fail to adequately model the spatio-temporal relationships of actions in videos, and video annotation is a time-consuming and expensive process. This paper proposes a Self-Supervised Spatio-Temporal Attention Network (SSTA-Net) for action recognition to solve the above problems. Firstly, we use a self-supervised method for training, which does not require a large amount of labeled data and can explore unknown or hidden information in the data. Secondly, in the feature extraction part, Multi-Scale Convolution Attention Module (MC-AM) is proposed. By performing convolution operations on the input image at different scales, the details and edge information in the image are enhanced, and the image quality of the original sampling frame is improved. Finally, a Spatio-Temporal Attention Module (ST-A) is proposed. The module is used to capture the spatio-temporal signal sensitivity in the video, which effectively improves the accuracy of action recognition.
What problem does this paper attempt to address?