Abstract:Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives. To tackle this issue, we propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation as well as cross-modal alignment, facilitating accurate localization of target moments. Specifically, LMR introduces a context enhancement technique with LLMs to generate crucial target-related context semantics. These semantics are integrated with visual features for producing discriminative video representations. Finally, a language-conditioned transformer is designed to decode free-form language queries, on the fly, using aligned video representations for moment retrieval. Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28\% and 4.06\% on the challenging QVHighlights and Charades-STA benchmarks, respectively. More importantly, the performance gains are significantly higher for localization of complex queries.

What problem does this paper attempt to address?

This paper attempts to solve the problem of complex situation location in video moment retrieval (VMR). Specifically, current VMR methods have difficulty in aligning these complex scenes when dealing with complex situations that include specific environmental details, character descriptions, and action narratives. To overcome this challenge, the authors propose a large - language - models - (LLMs - ) guided moment retrieval method (LMR), which utilizes the extensive knowledge of LLMs to enhance video context representation and cross - modal alignment, thereby achieving accurate location of the target moment. ### Main Contributions 1. **Proposing the LLM - guided moment retrieval method**: This method combines visual content with target - related context information extracted from LLMs for moment retrieval. In addition, the authors construct a C - QVal dataset containing complex queries, demonstrating the effectiveness of LMR in handling complex scenes. 2. **Designing the language - conditioned transformer**: This transformer can decode free - form language queries in real - time and use aligned video representations for moment retrieval. 3. **Experimental verification**: Extensive experiments show that this method achieves top - level performance on popular benchmark datasets, outperforming the closest competitors by 3.28% and 4.06% on the QVHighlights and Charades - STA datasets respectively. ### Method Overview 1. **Visual and text encoders**: Use CLIP and SlowFast as visual backbone networks to extract visual features, and use CLIP - Text as a text backbone network to extract language features. 2. **LLM - guided visual context enhancement**: Utilize pre - trained multi - modal large - language - models (MLLMs) to generate multi - view description texts of video moments, and extract target - related context information from them to enhance the context semantic representation of the video. 3. **Video context modeling**: Combine target - related visual features and context information generated by LLMs to form a comprehensive video representation. By introducing a randomly initialized saliency marker, predict the correlation score between the video and the text query. 4. **Language - conditioned transformer**: Adopt the encoder - decoder structure of the DETR architecture, use a small number of moment queries as the input of the decoder, and ensure that all queries are only for the target moment. Through the attention mechanism, the decoder can better learn the relationship between the video representation and the query language. 5. **Location and loss functions**: Predict the time bounding box of the target moment based on language - conditioned visual features, and use L1 loss, generalized IoU loss, and cross - entropy loss to measure the difference between the prediction and the ground truth. In addition, a cross - sample - supervised contrastive learning loss is also introduced. ### Experimental Results - **QVHighlights dataset**: LMR significantly outperforms existing methods on multiple evaluation metrics, especially under more stringent IoU thresholds. For example, it improves by 2.23% and 3.28% on the R1@0.7 and mAP@0.75 metrics respectively. - **Charades - STA dataset**: LMR achieves a significant improvement on the R@1 metric, especially improving by 3.14% and 4.06% on the R1@0.5 and R1@0.7 metrics respectively. ### Conclusion By leveraging the extensive knowledge of large - language - models, the LMR method performs well in handling complex video moment retrieval tasks and significantly improves the location accuracy of the target moment. These results verify the effectiveness of LLMs in enhancing video context modeling and improving cross - modal alignment capabilities.

Context-Enhanced Video Moment Retrieval with Large Language Models

LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval

Retrieval-based Video Language Model for Efficient Long Video Question Answering

Long Context Transfer from Language to Vision

Prior Knowledge Integration via LLM Encoding and Pseudo Event Regulation for Video Moment Retrieval

Understanding Long Videos with Multimodal Language Models

VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval

LongVLM: Efficient Long Video Understanding via Large Language Models

VideoLLM: Modeling Video Sequence with Large Language Models

Visual Context Window Extension: A New Perspective for Long Video Understanding

VTimeLLM: Empower LLM to Grasp Video Moments

Enhancing Advanced Visual Reasoning Ability of Large Language Models

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Exploring the Design Space of Visual Context Representation in Video MLLMs

ST-LLM: Large Language Models Are Effective Temporal Learners

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models