Long Context Transfer from Language to Vision

Peiyuan Zhang,Kaichen Zhang,Bo Li,Guangtao Zeng,Jingkang Yang,Yuanhan Zhang,Ziyue Wang,Haoran Tan,Chunyuan Li,Ziwei Liu

2024-07-01

Abstract:Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at <a class="link-external link-https" href="https://github.com/EvolvingLMMs-Lab/LongVA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the challenges faced by large multimodal models (LMMs) in understanding extremely long videos. While current LMMs perform well on tasks involving single images and short videos, they still encounter significant difficulties when dealing with extremely long videos. The main reason is the excessive number of visual tokens generated by visual encoders, making it difficult for existing models to effectively handle a large number of frames. To address this issue, the paper proposes the following contributions: 1. **Long Context Transfer**: The study finds that the context length of language models can be directly transferred to multimodal models. By extending the context length of language models, the ability of LMMs to handle long videos is enhanced without the need for additional training on long videos. 2. **Visual Needle-in-a-Haystack Benchmark (V-NIAH)**: To evaluate the ability of LMMs to locate and retrieve visual information in extremely long contexts, a synthetic benchmark V-NIAH is proposed, which is an extension based on the "needle-in-a-haystack" test in language models. 3. **Long Video Assistant (LongV A)**: By utilizing long context transfer and a unified encoding scheme (UniRes), the LongV A model is developed, capable of perceiving over 200,000 visual tokens. It achieves state-of-the-art performance on the Video-MME and MLVU datasets. The paper demonstrates that LongV A not only excels in long video question-answering benchmarks but also effectively handles video inputs with a large number of frames in practical applications. Its performance improves as the number of sampled frames increases. Additionally, the paper introduces the synthetic benchmark V-NIAH for effectively measuring the visual context length of video LMMs. These contributions provide new methods and tools for understanding and processing long videos.

Long Context Transfer from Language to Vision

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Visual Context Window Extension: A New Perspective for Long Video Understanding

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

Streaming Long Video Understanding with Large Language Models

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

Retrieval-based Video Language Model for Efficient Long Video Question Answering

Context-Enhanced Video Moment Retrieval with Large Language Models

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Understanding Long Videos with Multimodal Language Models

Exploring the Design Space of Visual Context Representation in Video MLLMs

World Model on Million-Length Video And Language With Blockwise RingAttention