Number it: Temporal Grounding Videos like Flipping Manga

Yongliang Wu,Xinting Hu,Yuyang Sun,Yizhou Zhou,Wenbo Zhu,Fengyun Rao,Bernt Schiele,Xu Yang
2024-11-16
Abstract:Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at <a class="link-external link-https" href="https://github.com/yongliang-wu/NumPro" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the precise timestamp identification in Video Temporal Grounding (VTG). Specifically, although existing Video - Large Language Models (Vid - LLMs) have made remarkable progress in understanding video content, they have difficulties in transforming visual understanding into specific temporal information (i.e., the specific time or number of frames when an event occurs). This leads to the situation where the models are prone to errors or inaccuracies when answering time - related questions. For example, when asked "When does the woman start eating?", the model may wrongly answer "From frame 000 to frame 580". This limitation stems from the fact that these models are mainly trained to align visual content with language descriptions (i.e., what happened), lacking a mechanism to directly interpret time boundaries (i.e., when it happened). To overcome this challenge, the paper proposes a new method named Number - Prompt (NumPro). By adding unique numerical identifiers to each video frame, it enables Vid - LLMs to combine visual understanding with temporal localization. This method transforms the VTG task into a process similar to flipping through comic panels, allowing the model to "read" the timeline of events and accurately associate visual content with corresponding time information. In this way, NumPro not only improves the performance of VTG but also does not require additional computational cost. Moreover, fine - tuning based on the NumPro - enhanced dataset can further promote the state - of - the - art level of VTG.