Debiased Video-Text Retrieval via Soft Positive Sample Calibration

Huaiwen Zhang,Yang Yang,Fan Qi,Shengsheng Qian,Changsheng Xu
DOI: https://doi.org/10.1109/tcsvt.2023.3248873
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:With the emergence of enormous videos on various video apps, semantic video-text retrieval has become a critical task for improving the user experience. The primary paradigm for video-text retrieval learns the semantic video-text representations in a common space by pulling the positive samples close to the query and pushing the negative samples away. However, in practice, the video-text datasets contain only the annotations of positive samples. The negative samples are randomly drawn from the entire dataset. There may exist soft positive samples, which are sampled as negatives but share the same semantics as positive samples. Indiscriminately enforcing the model to push all the negative samples away from the query leads to inaccurate supervision and then misleads the video-text feature representation learning. In this paper, we introduce debiased video-text retrieval objectives that calibrate the punishment of soft positive samples. In particular, we propose a novel uncertainty measure framework to estimate the credibility of negative samples for each instance. Then, the reliability of negative samples is used to find the soft positive samples and rescale their contribution within video-text retrieval losses, including triplet loss and contrastive loss. Experimental results on five widely used datasets demonstrate that our debiased video-text retrieval objectives achieve significant performance improvements and establish a new state-of-the-art.
engineering, electrical & electronic
What problem does this paper attempt to address?