Abstract:Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of the long-context capabilities of VLMs using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model's context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the performance degradation problem encountered by Vision - Language Models (VLMs) when handling long - context multimodal tasks. Specifically, VLMs perform poorly when dealing with tasks involving long - sequence inputs such as long videos, high - resolution images, or long - text - and - image documents. The main problems include: 1. **Inadaptability of position encoding mechanism**: Directly applying the text position encoding mechanism to visual tokens is sub - optimal, causing a sharp decline in VLM performance when dealing with position encodings that exceed the length of its training window. 2. **Great limitations of existing methods**: Current methods can either handle only a small number of images (usually less than 5), or are mainly targeted at video data, and most of these methods are limited to specific application scenarios and cannot effectively handle complex long - sequence multimodal data. To solve these problems, the author proposes **Variable Visual Position Encoding (V2PE)**, which more efficiently manages visual tokens in long multimodal sequences by using smaller and variable position increments. This method significantly enhances the understanding and reasoning ability of VLMs in handling long - context multimodal tasks. ### Specific improvement measures 1. **Construct a large - scale long - context multimodal dataset**: Expand existing instruction - tuning datasets (such as DocVQA, ChartQA, SQA) to include long sequences of 32K to 256K tokens, and create a validation set to evaluate the model's performance in a longer context. 2. **Propose the V2PE method**: Adopt smaller and variable position increments for visual tokens, enabling the model to better adapt to image inputs of different lengths and complexities, thereby improving its stability and adaptability in long - context processing. 3. **Experimental verification**: Apply V2PE to the open - source high - performance VLM (InternVL2 - 2B) and fine - tune it on the extended dataset. The results show that the fine - tuned model not only performs well in standard short - context multimodal benchmark tests but also achieves excellent results in tasks requiring long - context processing and can handle multimodal sequences of up to 1M tokens. Through these improvements, the paper demonstrates the effectiveness of V2PE and provides a new direction for future research.

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

Long Context Transfer from Language to Vision

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Visual Context Window Extension: A New Perspective for Long Video Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Visual Oriented Encoder: Integrating Multimodal and Multi-Scale Contexts for Video Captioning

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

EVLM: An Efficient Vision-Language Model for Visual Understanding

Efficient Large Multi-modal Models via Visual Context Compression

VoCo-LLaMA: Towards Vision Compression with Large Language Models

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language Models

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Exploring the Design Space of Visual Context Representation in Video MLLMs

VLP2MSA: Expanding Vision-Language Pre-Training to Multimodal Sentiment Analysis

Exploring Context Window of Large Language Models via Decomposed Positional Vectors

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings