You watch once more: a more effective CNN architecture for video spatio-temporal action localization

Yefeng Qin,Lei Chen,Xianye Ben,Mingqiang Yang
DOI: https://doi.org/10.1007/s00530-023-01254-z
IF: 3.9
2024-02-01
Multimedia Systems
Abstract:The task of spatio-temporal action localization (STAL) needs to detect the action and position of individuals in the scene. Many works cannot model spatio-temporal information well, and they usually ignore inference speed and practical applications. To address the above problems, we propose a new end-to-end spatio-temporal action localization network called You Watch Once More (YWOM). Two backbones are applied to extract spatio-temporal information effectively. In this work, there are three measures proposed to improve the accuracy of positioning and recognition while guaranteeing the inference speed. First, a new feature fusion mechanism based on frequency channel attention (FCA) is proposed, which can effectively fuse the features extracted by different backbones. In addition, a new loss function is proposed to speed up the regression and convergence of the bounding box. Specifically, the SIOU regression loss function instead of the smooth L1 loss function is applied to help the model converge stably. Moreover, a lateral connection mechanism is designed to apply more backbones to our network structure. The experimental results demonstrate that YWOM can achieve online inference speed and has good performance in spatio-temporal action localization tasks. YWOM has superiority over other related works including YOWO on the UCF101-24 dataset. The frame-mAP and the video-mAP (0.2) are improved by 4.23% and 1.63% respectively.
computer science, information systems, theory & methods
What problem does this paper attempt to address?