Spatial-temporal Pooling for Action Recognition in Videos
Jiaming Wang,Zhenfeng Shao,Xiao Huang,Tao Lu,Ruiqian Zhang,Xianwei Lv
DOI: https://doi.org/10.1016/j.neucom.2021.04.071
IF: 6
2021-01-01
Neurocomputing
Abstract:Recently, deep convolutional neural networks have demonstrated great effectiveness in action recognition with both RGB and optical flow in the past decade. However, existing studies generally treat all frames and pixels equally, potentially leading to poor robustness of models. In this paper, we propose a novel parameter-free spatial–temporal pooling block (referred to as STP) for action recognition in videos to address this challenge. STP is proposed to learn spatial and temporal weights, which are further used to guide information compression. Different from other temporal pooling layers, STP is more efficient as it discards the non-informative frames in a certain clip. In addition, STP applies a novel loss function that forces the model to learn information from sparse and discriminative frames. Moreover, we introduce a dataset for ferry action classification, named Ferryboat-4, which includes four categories: Inshore, Offshore, Traffic, and Negative. This designed dataset can be used for the identification of ferries with abnormal behaviors, providing the essential information to support the supervision, management, and monitoring of ships. All the videos are acquired via real-world cameras. We perform extensive experiments on publicly available datasets as well as Ferryboat-4 and find that the proposed method outperforms several state-of-the-art methods in action classification. Source code and datasets are available at https://github.com/jiaming-wang/STP.