LLM4VG: Large Language Models Evaluation for Video Grounding

Wei Feng,Xin Wang,Hong Chen,Zeyang Zhang,Houlun Chen,Zihan Song,Yuwei Zhou,Yuekui Yang,Haiyang Wu,Wenwu Zhu
2024-09-12
Abstract:Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is: The current capabilities of large language models (LLMs) in handling video grounding (VG) tasks are unclear and have not been fully explored. The video grounding task requires the model to accurately locate the start and end timestamps of the time segments in the video that match a given text query. Although existing LLMs have achieved significant success in many tasks, their ability to handle video tasks that require precise time boundary localization remains unclear. To fill this research gap, the authors propose the LLM4VG benchmark, which aims to systematically evaluate the performance of different LLMs on the video grounding task. Through the design of extensive experiments, the authors examine the performance of two types of video LLM models in the video grounding task: one type is video LLMs trained directly on text-video pairs (referred to as VidLLM), and the other type combines pre-trained visual description models with LLMs, such as video/image captioning models. Additionally, the authors propose customized prompting methods to integrate VG instructions and descriptive information from different types of generators, including visual descriptions directly from caption-based generators and enhanced information from visual question answering (VQA)-based generators. These experiments not only compare the performance of various VidLLMs but also explore the impact of different choices of visual models, LLMs, and prompt designs. Overall, by proposing the LLM4VG benchmark, the paper aims to evaluate and analyze the performance of LLMs in the video grounding task, thereby providing a foundation for further research and development.