Dual Masked Modeling for Weakly-Supervised Temporal Boundary Discovery
Yuer Ma,Yi Liu,Limin Wang,Wenxiong Kang,Yu Qiao,Yali Wang
DOI: https://doi.org/10.1109/tmm.2023.3338084
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Discovering temporal boundary is critical for untrimmed video tasks, such as temporal sentence grounding and action detection. Due to the labor-intensive boundary annotations, the recent studies focus on the weakly-supervised setting, with only sentences or action tags in the training videos. However, how to align temporal boundaries and textual descriptions is problematic in most weakly-supervised approaches. To alleviate this difficulty, we propose a novel Dual Masked Modeling (DM2) framework, which can effectively enhance clip-text alignment to boost temporal boundary discovery, by cross-modal masked modeling in the dual fashion. Specifically, we introduce two coupled reconstruction branches, i.e., Clip-Aware Masked Text Modeling (C-MTM), and Text-Aware Masked Clip Modeling (T-MCM), after generating a temporal proposal of the underlying clip. In C-MTM, we recover the masked text with visual assistance of the clip proposal. In T-MCM, we recover the masked clip proposal with lingual assistance of the text. Via such complementary reconstruction supervision, our DM2 can cooperatively exploit robust matching between the video clip and the referred text, allowing to unify grounding and localization in a concise manner. Finally, we perform extensive experiments on the popular temporal benchmarks, i.e., Charades-STA, ActivityNet Captions, ActivityNet-v1.3 and THUMOS-14. Our DM2 achieves state-of-the-art for both weakly-supervised temporal grounding and localization. Codes and models will be released afterward.