TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

Shuhuai Ren,Linli Yao,Shicheng Li,Xu Sun,Lu Hou
2024-03-28
Abstract:This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally, we construct an instruction-tuning dataset, encompassing 6 tasks and a total of 125K instances, to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks, such as dense captioning, temporal grounding, and highlight detection, demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example, it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2, +5.8 HIT@1 on QVHighlights, and +27.5 R@1 (IoU=0.5) on Charades-STA, compared to state-of-the-art video large language models, holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
The paper attempts to address the limitations of existing Video Large Language Models (VidLLMs) in handling time-sensitive tasks in long video understanding. Specifically, the existing models have shortcomings in the following areas: 1. **Inaccurate Timestamp Association**: Existing models fail to associate important events in the video with accurate timestamps, resulting in low accuracy in locating and describing meaningful events in unedited long videos. 2. **Fixed Compression Rate Leading to Semantic Degradation**: Existing models typically compress video frames into a fixed number of visual tokens, which leads to severe spatiotemporal semantic degradation when processing long videos. 3. **Separate Processing of Visual and Temporal Information**: Existing models process visual information and timestamp information separately, lacking explicit temporal-visual association, thus failing to accurately locate timestamps. To address these issues, the paper proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding. TimeChat enhances its temporal and visual association capabilities through the following two key modules: 1. **Time-Aware Frame Encoder**: Explicitly binds the visual content of each frame with its timestamp description, thereby improving the accuracy of temporal localization. 2. **Sliding Video Q-Former**: Dynamically generates video token sequences of different lengths through a sliding window, adapting to video inputs of varying lengths and preserving important visual semantics in long videos. Additionally, the paper constructs a time-sensitive instruction tuning dataset, TimeIT, containing 6 tasks and 125K instances, to further enhance TimeChat's instruction-following performance. Experimental results show that TimeChat significantly outperforms existing VidLLMs on multiple video understanding tasks, demonstrating its strong capability and versatility in long video understanding.