Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset

Yuchen Yang,Yingxuan Duan
2024-06-20
Abstract:A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an agent-like strategy using language models to generate high-quality, factual textual descriptions, reducing human intervention and enabling scalability. The method's effectiveness in improving language-video representation is evaluated through text-video retrieval using the MSR-VTT dataset and several multi-modal retrieval models.
Multimedia,Computer Vision and Pattern Recognition,Information Retrieval
What problem does this paper attempt to address?
This paper attempts to address the issue of language-video representation in video understanding being insufficiently robust and comprehensive. Specifically: 1. **Problems with existing datasets**: Current datasets have overly simplistic descriptions and focus only on visual information, leading to poor performance in practical applications such as natural language video retrieval tasks. 2. **Insufficient multimodal information**: Existing methods mostly focus on the visual part of the video, neglecting other important multimodal information such as audio and emotional cues. 3. **Semantic alignment issues**: The semantic correlation between text descriptions and video content is poor, especially in non-instructional videos. To tackle these issues, the paper proposes a method to automatically enhance video-language datasets, making them more modality and context-aware, thereby improving the effectiveness of representation learning and aiding various downstream tasks. This method captures detailed associative information from text to video through various video subtitle generation techniques, including entities, actions, speech transcription, aesthetics, and emotional cues. Additionally, a language model-based proxy strategy is developed to generate high-quality factual text descriptions, reducing human intervention and achieving scalability. The paper evaluates the effectiveness of this method using the MSR-VTT dataset and multiple multimodal retrieval models.