Abstract:Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. To effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. The TRACE processes visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation. Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are available at \url{<a class="link-external link-https" href="https://github.com/gyxxyg/TRACE" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

This paper attempts to solve the key problem in the Video Temporal Grounding (VTG) task, that is, how to effectively handle the inherent structure of videos. Specifically, current methods based on video large - language models (video LLMs) mainly rely on natural language generation and lack the ability to model the clear structure in videos, which limits their performance in VTG tasks. ### Main Problems 1. **Mismatch between Video Structure and Language Models**: Existing video LLM methods fail to fully consider the time - series characteristics of videos, resulting in poor performance when dealing with VTG tasks. 2. **Insufficient Multitasking Ability**: Existing methods are difficult to handle multiple related tasks simultaneously and perform poorly in zero - shot prediction. ### Solutions To solve the above problems, the paper proposes a causal event modeling framework, which represents videos as a series of events, each event containing a timestamp, a saliency score, and a text description. On this basis, a new task - interleaved video LLM model - TRACE is designed to effectively implement the causal event modeling framework. ### Specific Improvements - **Causal Event Modeling Framework**: By introducing the causal event modeling framework, the video is decomposed into a series of ordered events, each event consisting of a timestamp, a saliency score, and a text description. This structured representation enables the model to better capture the temporal order and content changes of the video. \[ V=\{e_1, e_2, \cdots, e_K\}=\{(t_k, s_k, c_k) | 1 \leq k \leq K\} \] where \(t_k\) is the timestamp, \(s_k\) is the saliency score, and \(c_k\) is the text description. - **TRACE Model**: TRACE implements the causal event modeling framework by separating multitasking, task - interleaved sequence modeling, and an adaptive head - switching mechanism. Specifically: - **Separate Multitasking**: TRACE uses different encoders and decoder heads to process visual frames, timestamps, scores, and text respectively. - **Task - Interleaved Sequence Modeling**: Task tokens are arranged in an interleaved order to simulate the temporal order of the video. - **Adaptive Head - Switching Mechanism**: During the generation process, TRACE selects the appropriate decoder head according to the previously decoded tokens. ### Experimental Results The experimental results show that TRACE outperforms existing video LLM methods in multiple VTG tasks, especially in zero - shot prediction. For example, on the Youcook2 dataset, the CIDEr and F1 Score of TRACE are increased by 3.1% and 4.9% respectively; on the Charades - STA dataset, Recall (IOU = 0.5) is increased by 6.5% and Recall (IOU = 0.7) is increased by 3.7%; on the QVHighlights dataset, mAP and HIT@1 are increased by 10.3% and 9.2% respectively. ### Summary By introducing the causal event modeling framework and the TRACE model, this paper solves the problem that video LLMs cannot effectively handle the inherent structure of videos in VTG tasks, and significantly improves the performance of the model on multiple tasks.

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

Enhancing Temporal Modeling of Video LLMs via Time Gating

VTimeLLM: Empower LLM to Grasp Video Moments

Target Adaptive Context Aggregation for Video Scene Graph Generation

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning

ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding

VideoLLM: Modeling Video Sequence with Large Language Models

LLM4VG: Large Language Models Evaluation for Video Grounding

ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos

EA-VTR: Event-Aware Video-Text Retrieval

Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge

Enhancing Long Video Understanding via Hierarchical Event-Based Memory

Temporal Reasoning Transfer from Text to Video

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations