Separately Guided Context-Aware Network for Weakly Supervised Temporal Action Detection

Bairong Li,Yifan Pan,Ruixin Liu,Yuesheng Zhu
DOI: https://doi.org/10.1007/s11063-022-11138-4
IF: 2.565
2023-01-01
Neural Processing Letters
Abstract:Weakly supervised temporal action detection uses the extracted appearance and motion features to localize the action segments in untrimmed videos with only action category labels. Most previous methods detect action segments based on temporally local features, and employ the early fusion or the late fusion machine to combine the knowledge of two feature modalities. However, the temporally local features generally lead to incomplete detection results, and the above-mentioned fusion machines cannot fully use the complementary information between different modalities. In this paper, we propose the separately guided context-aware network to exploit the global contexts and sufficiently leverage different modality information to detect action segments. Specifically, we propose to construct graphs by modeling the co-occurrence relations between frames to gather the global contexts. To fully combine the complementary information of two modalities, the separately guided scheme is proposed, which utilizes two graphs for each feature modality to integrate the contexts revealed by the intra-modality and the cross-modality information respectively. This scheme sufficiently enhances frame representations based on two modalities and facilitates the detection of action frames. And we also present the co-occurrence relation learning strategy under weak supervision to better guide graphs in gathering contexts. Extensive experiments on the THUMOS14 dataset and the ActivityNet dataset demonstrate the superior performance of the proposed method. Particularly, the proposed method achieves a mean average precision of 39.1% and 42.0% on the THUMOS14 and the ActivityNet dataset respectively under the IoU threshold 0.5.
What problem does this paper attempt to address?