Weakly-Supervised Temporal Action Localization by Progressive Complementary Learning

Jia-Run Du,Jia-Chang Feng,Kun-Yu Lin,Fa-Ting Hong,Zhongang Qi,Ying Shan,Jian-Fang Hu,Wei-Shi Zheng
DOI: https://doi.org/10.1109/tcsvt.2024.3456795
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Weakly-Supervised Temporal Action Localization (WSTAL) aims to localize and classify action instances in long untrimmed videos with only video-level category labels as supervision. A critical challenge of WSTAL is the large gap between video-level supervision and unavailable snippet-level supervision. Prevailing methods typically assign pseudo labels to snippets, but these methods suffer from significant noise caused by the pseudo snippet-level labels. In this work, we address the WSTAL from a novel category exclusion perspective, which gradually enhances the snippet-level supervision to bridge the gap. Our proposed Progressive Complementary Learning (ProCL) is inspired by the fact that, video-level labels precisely indicate the categories that all snippets surely do not belong to, which is ignored by previous works. Accordingly, we first exclude these surely non-existent categories by the deterministic complementary learning. And then, we introduce the entropy-based pseudo complementary learning that is able to exclude more categories for snippets of less ambiguity. Furthermore, for the remaining ambiguous snippets, we attempt to reduce the ambiguity by distinguishing foreground actions from the background. Extensive experimental results show that our method achieves new state-of-the-art performance on THUMOS14, ActivityNet1.3, and MultiTHUMOS benchmarks.
engineering, electrical & electronic
What problem does this paper attempt to address?