A Snippets Relation and Hard-Snippets Mask Network for Weakly-Supervised Temporal Action Localization

Yibo Zhao,Hua Zhang,Zan Gao,Weili Guan,Meng Wang,Shengyong Chen
DOI: https://doi.org/10.1109/tcsvt.2024.3374870
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Weakly-supervised temporal action localization (WTAL) is a problem learning an action localization model with only video-level labels available. In recent years, many WTAL methods have developed. However, hard-to-predict snippets near action boundaries are often not considered in these existing approaches, causing action incompleteness and action over-complete issues. To solve these issues, in this work, an end-to-end snippets relation and hard-snippets mask network (SRHN) is proposed. Specifically, a hard-snippets mask module is applied to mask the hard-to-predict snippets adaptively, and in this way, the trained model focuses more on those snippets with low uncertainty. Then, a snippets relation module is designed to capture the relationship among snippets and can make hard-to-predict snippets easy to predict by aggregating the information of multiple temporal receptive fields. Finally, a snippet enhancement loss is further developed to reduce the action probabilities that are not present in videos for hard-to-predict snippets and other snippets, enlarging the action probabilities that exist in videos. Extensive experiments on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets demonstrate the effectiveness of the SRHN method.
What problem does this paper attempt to address?