Number it: Temporal Grounding Videos like Flipping Manga

Yongliang Wu,Xinting Hu,Yuyang Sun,Yizhou Zhou,Wenbo Zhu,Fengyun Rao,Bernt Schiele,Xu Yang

2024-11-16

Abstract:Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at <a class="link-external link-https" href="https://github.com/yongliang-wu/NumPro" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the precise timestamp identification in Video Temporal Grounding (VTG). Specifically, although existing Video - Large Language Models (Vid - LLMs) have made remarkable progress in understanding video content, they have difficulties in transforming visual understanding into specific temporal information (i.e., the specific time or number of frames when an event occurs). This leads to the situation where the models are prone to errors or inaccuracies when answering time - related questions. For example, when asked "When does the woman start eating?", the model may wrongly answer "From frame 000 to frame 580". This limitation stems from the fact that these models are mainly trained to align visual content with language descriptions (i.e., what happened), lacking a mechanism to directly interpret time boundaries (i.e., when it happened). To overcome this challenge, the paper proposes a new method named Number - Prompt (NumPro). By adding unique numerical identifiers to each video frame, it enables Vid - LLMs to combine visual understanding with temporal localization. This method transforms the VTG task into a process similar to flipping through comic panels, allowing the model to "read" the timeline of events and accurately associate visual content with corresponding time information. In this way, NumPro not only improves the performance of VTG but also does not require additional computational cost. Moreover, fine - tuning based on the NumPro - enhanced dataset can further promote the state - of - the - art level of VTG.

Number it: Temporal Grounding Videos like Flipping Manga

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Towards Visual-Prompt Temporal Answer Grounding in Instructional Video

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

Towards Visual-Prompt Temporal Answering Grounding in Medical Instructional Video

Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

VTimeLLM: Empower LLM to Grasp Video Moments

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Temporal Reasoning Transfer from Text to Video

LLM4VG: Large Language Models Evaluation for Video Grounding

ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

Enhancing Temporal Modeling of Video LLMs via Time Gating

Temporal Sentence Grounding in Videos: A Survey and Future Directions

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval

T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs