Weakly-Supervised Temporal Action Localization Via Cross-Stream Collaborative Learning.

Yuan Ji,Xu Jia,Huchuan Lu,Xiang Ruan
DOI: https://doi.org/10.1145/3474085.3475261
2021-01-01
Abstract:Weakly supervised temporal action localization (WTAL) is a challenging task as only video-level category labels are available during training stage. Without precise temporal annotations, most approaches rely on complementary RGB and optical flow features to predict the start and end frame of each action category in a video. However, existing approaches simply resort to either concatenation or weighted sum to learn how to take advantages of these two modalities for accurate action localization, which ignore the substantial variance between such two modalities. In this paper, we present Cross-Stream Collaborative Learning (CSCL) to address these issues. The proposed CSCL introduce a cross-stream weighting module to identify which modality is more robust during training and take advantage of the robust modality to guide the weaker one. Furthermore, we suppress the snippets which has high action-ness scores in both modalities to further exploiting the complementary property between two modalities. In addition, we bring the concept of co-training for WTAL and take both modalities into account for pseudo label generation to help training a stronger model. Extensive experiments conducted on THUMOS14 and ActivityNet dataset demonstrate that CSCL achieves a favorable performance against state-of-the-arts methods.
What problem does this paper attempt to address?