Exploiting Human Pose for Weakly-Supervised Temporal Action Localization

Bing Zhu,Tianyu Li,Xinxiao Wu
DOI: https://doi.org/10.1007/978-3-030-31726-3_40
2019-01-01
Abstract:Weakly-supervised temporal action localization aims to predict when and what actions occur in untrimmed videos with only video-level class labels. Most current methods make prediction based on global features, while ignoring the classification performance of local descriptions of human body. Additionally, these methods generate incomplete proposals via thresholding, which is too single and crude. To acquire high-quality proposals, we focus on incorporating local information, i.e. human body poses in videos, and propose a noval method called Class Activation and Pose Pattern (CAPP) for weakly-supervised temporal action localization. In our method, action proposals are generated by two modules: a Class Activation Sequence (CAS) module and a Pose Pattern Sequence (PPS) module. The CAS module fuses global features and local features to improve clip-level classification performance and the PPS module adds complementary proposals with high recall via pose pattern clustering. CAPP outperforms the state-of-the-art methods on both the THUMOS-14 and ActivityNet v1.2 datasets, which demonstrates the effectiveness of our method.
What problem does this paper attempt to address?