Abstract:This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.

What problem does this paper attempt to address?

The paper attempts to address the limitations of existing Video Large Language Models (VidLLMs) in handling time-sensitive tasks in long video understanding. Specifically, the existing models have shortcomings in the following areas: 1. **Inaccurate Timestamp Association**: Existing models fail to associate important events in the video with accurate timestamps, resulting in low accuracy in locating and describing meaningful events in unedited long videos. 2. **Fixed Compression Rate Leading to Semantic Degradation**: Existing models typically compress video frames into a fixed number of visual tokens, which leads to severe spatiotemporal semantic degradation when processing long videos. 3. **Separate Processing of Visual and Temporal Information**: Existing models process visual information and timestamp information separately, lacking explicit temporal-visual association, thus failing to accurately locate timestamps. To address these issues, the paper proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. TimeChat enhances its temporal and visual association capabilities through the following two key modules: 1. **Time-Aware Frame Encoder**: Explicitly binds the visual content of each frame with its timestamp description, thereby improving the accuracy of temporal localization. 2. **Sliding Video Q-Former**: Dynamically generates video token sequences of different lengths through a sliding window, adapting to video inputs of varying lengths and preserving important visual semantics in long videos. Additionally, the paper constructs a time-sensitive instruction tuning dataset, TimeIT, containing 6 tasks and 125K instances, to further enhance TimeChat's instruction-following performance. Experimental results show that TimeChat significantly outperforms existing VidLLMs on multiple video understanding tasks, demonstrating its strong capability and versatility in long video understanding.

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding

MovieChat+: Question-aware Sparse Memory for Long Video Question Answering

OmChat: A Recipe to Train Multimodal Language Models with Strong Long Context and Video Understanding

Streaming Long Video Understanding with Large Language Models

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

VTimeLLM: Empower LLM to Grasp Video Moments

VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

ST-LLM: Large Language Models Are Effective Temporal Learners

VideoChat: Chat-Centric Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning

VideoAgent: Long-form Video Understanding with Large Language Model as Agent