Learning Semantics-Grounded Vocabulary Representation for Video-Text Retrieval

Yaya Shi,Haowei Liu,Haiyang Xu,Zongyang Ma,Qinghao Ye,Anwen Hu,Ming Yan,Ji Zhang,Fei Huang,Chunfeng Yuan,Bing Li,Weiming Hu,Zheng-Jun Zha
DOI: https://doi.org/10.1145/3581783.3612537
2023-01-01
Abstract:Previous dual-encoder pre-training methods for video-text retrieval employ contrastive learning for cross-modal alignment in a latent space. However, such learned latent spaces often result in modality gap problem [26]. In this paper, we introduce a novel SemVTR framework designed to learn semantics-grounded video-text representations in a vocabulary space, in which each dimension corresponds to a semantic concept represented by a word. The representation is obtained by grounding video and text into semantically-related dimensions with high activation values. As video-text pairs share grounded dimensions, their vocabulary representations are expected to cluster together and thus alleviate modality gap problem. So, the crux of our method lies in grounding video and text into vocabulary space. Specifically, we propose a Multi-Granularity Video Semantics Grounding approach and a Textual Semantics Preserving training strategy. The visualization illustrates that SemVTR obtains semantics-gronded vocabulary representation and also alleviates the modality gap problem. SemVTR significantly outperforms existing methods on four video-text retrieval benchmarks.
What problem does this paper attempt to address?