Regularized Two Granularity Loss Function for Weakly Supervised Video Moment Retrieval

Junya Teng,Xiankai Lu,Yongshun Gong,Xinfang Liu,Xiushan Nie,Yilong Yin
DOI: https://doi.org/10.1109/tmm.2021.3120545
IF: 7.3
2022-01-01
IEEE Transactions on Multimedia
Abstract:Weakly supervised video moment retrieval or weakly supervised language moment retrieval aims to search the most relevant moment given a language query. In order to guide the model to capture the most matching video segments with the text description, we design a two-granularity loss function that simultaneously considers both video-level and instance-level relationships. Specifically, we first generate coarse video segments and regard each video segment as an instance. For video-level regularized multiple instance loss (MIL), we leverage the latent alignment between all intra-video segments ( ie. , positive bag) and text descriptions. Then, we classify these segments by regarding this procedure as a supervised learning task under noisy labels. With the instance-level regularized loss function, our model can learn to correct noisy instance-level labels so as to locate the more accurate frame boundary from all the positive instances. Comprehensive experimental results on ActivityNet and DiDeMo demonstrate that the proposed loss function sets a new state-of-the-art.
What problem does this paper attempt to address?