Dynamic Contrastive Learning with Pseudo-samples Intervention for Weakly Supervised Joint Video MR and HD

Shuhan Kong,Liang Li,Beichen Zhang,Wenyu Wang,Bin Jiang,Chenggang Yan,Changhao Xu
DOI: https://doi.org/10.1145/3581783.3612384
2023-01-01
Abstract:Joint video moment retrieval (MR) and highlight detection (HD) aims to find relevant video moments according to the query text. Existing methods are fully supervised based on manual annotation, and their coarse multi-modal information interactions easily lose details about video and text. In addition, some tasks introduce weakly supervised learning with random masks, while the single masking forces the model to focus on masked words and ignore multi-modal contextual information. In view of this, we attempt weakly supervised joint tasks (MR+HD) and propose Dynamic Contrastive Learning with Pseudo-Sample Intervention (CPI) for better multi-modal video comprehension. First, we design pseudo-samples over random masks for a more efficient contrastive learning manner. We introduce a proportional sampling strategy for pseudo-samples to ensure the semantic difference between the pseudo-samples and the query text. This balances the over-reliance from single random mask to global text semantics and makes the model learn multimodal context from each word fairly. Second, we design dynamic intervention contrastive loss to enhance the core feature-matching ability of the model dynamically. We add pseudo-sample intervention when negative proposals are close to positive proposals. This can help the model overcome the vision confusion phenomenon and achieve semantic similarity instead of word similarity. Extensive experiments demonstrate the effectiveness of CPI and the potential of weakly supervised joint tasks.
What problem does this paper attempt to address?