Multi-Modal Inductive Framework for Text-Video Retrieval

Qian Li,Yucheng Zhou,Cheng Ji,Feihong Lu,Jianian Gong,Shangguang Wang,Jianxin Li
DOI: https://doi.org/10.1145/3664647.3681024
2024-01-01
Abstract:Text-video retrieval (TVR) identifies relevant videos based on textual queries. Existing methods are limited by their ability to understand and connect different modalities, resulting in increased difficulty in retrievals. In this paper, we propose a generation-based TVR paradigm facilitated by LLM distillation to better learn and capture deep retrieval knowledge for text-video retrieval, amidsting the rapid evolution of Large Language Models. Specifically, we first design the fine-tuning large vision-language model that leverages the knowledge learned from language models to enhance the alignment of semantic information between the text and video modalities. It also incorporates an inductive reasoning mechanism, which focuses on incorporating important temporal and spatial features into the video embeddings. We further design question prompt clustering to select the most important prompts, considering their contribution to improving retrieval performance. Experimental results show that our approach achieves excellent performance on two benchmark datasets compared to its competitors.
What problem does this paper attempt to address?