HaVTR: Improving Video-Text Retrieval Through Augmentation Using Large Foundation Models

Yimu Wang,Shuai Yuan,Xiangru Jian,Wei Pang,Mushi Wang,Ning Yu
2024-04-08
Abstract:While recent progress in video-text retrieval has been driven by the exploration of powerful model architectures and training strategies, the representation learning ability of video-text retrieval models is still limited due to low-quality and scarce training data annotations. To address this issue, we present a novel video-text learning paradigm, HaVTR, which augments video and text data to learn more generalized features. Specifically, we first adopt a simple augmentation method, which generates self-similar data by randomly duplicating or dropping subwords and frames. In addition, inspired by the recent advancement in visual and language generative models, we propose a more powerful augmentation method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Further, to bring richer information into video and text, we propose a hallucination-based augmentation method, where we use LLMs and VGMs to generate and add new relevant information to the original data. Benefiting from the enriched data, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of HaVTR over existing methods.
Computer Vision and Pattern Recognition,Computation and Language,Information Retrieval,Machine Learning
What problem does this paper attempt to address?
The paper attempts to address the issue of Video-Text Retrieval by tackling the limitations of model learning capabilities caused by low-quality and scarce training data annotations. To solve this problem, the paper proposes a new framework called HaVTR, which utilizes large-scale language models and visual generation models for data augmentation to improve one-to-one matching of videos and texts and enhance retrieval performance. Specifically, HaVTR includes methods such as simple augmentation, text rewriting, video stylization augmentation, and illusion-based augmentation. Experimental results demonstrate that these methods outperform existing techniques on multiple benchmark tests, enhancing the performance of video-text retrieval.