Long Context Transfer from Language to Vision

Peiyuan Zhang,Kaichen Zhang,Bo Li,Guangtao Zeng,Jingkang Yang,Yuanhan Zhang,Ziyue Wang,Haoran Tan,Chunyuan Li,Ziwei Liu
2024-07-01
Abstract:Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at <a class="link-external link-https" href="https://github.com/EvolvingLMMs-Lab/LongVA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the challenges faced by large multimodal models (LMMs) in understanding extremely long videos. While current LMMs perform well on tasks involving single images and short videos, they still encounter significant difficulties when dealing with extremely long videos. The main reason is the excessive number of visual tokens generated by visual encoders, making it difficult for existing models to effectively handle a large number of frames. To address this issue, the paper proposes the following contributions: 1. **Long Context Transfer**: The study finds that the context length of language models can be directly transferred to multimodal models. By extending the context length of language models, the ability of LMMs to handle long videos is enhanced without the need for additional training on long videos. 2. **Visual Needle-in-a-Haystack Benchmark (V-NIAH)**: To evaluate the ability of LMMs to locate and retrieve visual information in extremely long contexts, a synthetic benchmark V-NIAH is proposed, which is an extension based on the "needle-in-a-haystack" test in language models. 3. **Long Video Assistant (LongV A)**: By utilizing long context transfer and a unified encoding scheme (UniRes), the LongV A model is developed, capable of perceiving over 200,000 visual tokens. It achieves state-of-the-art performance on the Video-MME and MLVU datasets. The paper demonstrates that LongV A not only excels in long video question-answering benchmarks but also effectively handles video inputs with a large number of frames in practical applications. Its performance improves as the number of sampled frames increases. Additionally, the paper introduces the synthetic benchmark V-NIAH for effectively measuring the visual context length of video LMMs. These contributions provide new methods and tools for understanding and processing long videos.