AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

Yonghui Wang,Wengang Zhou,Hao Feng,Houqiang Li

2024-08-30

Abstract:Over the past few years, the advancement of Multimodal Large Language Models (MLLMs) has captured the wide interest of researchers, leading to numerous innovations to enhance MLLMs' comprehension. In this paper, we present AdaptVision, a multimodal large language model specifically designed to dynamically process input images at varying resolutions. We hypothesize that the requisite number of visual tokens for the model is contingent upon both the resolution and content of the input image. Generally, natural images with a lower information density can be effectively interpreted by the model using fewer visual tokens at reduced resolutions. In contrast, images containing textual content, such as documents with rich text, necessitate a higher number of visual tokens for accurate text interpretation due to their higher information density. Building on this insight, we devise a dynamic image partitioning module that adjusts the number of visual tokens according to the size and aspect ratio of images. This method mitigates distortion effects that arise from resizing images to a uniform resolution and dynamically optimizing the visual tokens input to the LLMs. Our model is capable of processing images with resolutions up to $1008\times 1008$. Extensive experiments across various datasets demonstrate that our method achieves impressive performance in handling vision-language tasks in both natural and text-related scenes. The source code and dataset are now publicly available at \url{<a class="link-external link-https" href="https://github.com/harrytea/AdaptVision" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the performance and efficiency issues of Multimodal Large Language Models (MLLMs) when handling images of different resolutions and types. Specifically, existing MLLMs face the following challenges when dealing with natural scenes and text-dense scenes: 1. **Fixed Resolution Processing**: Most existing methods rely on static resolution, resulting in a fixed number of visual tokens for the input, which may not be suitable for images of different sizes and types. 2. **Insufficient Text-Dense Image Processing**: Existing models perform poorly when handling high-resolution text-dense images because they are primarily trained on natural images and lack the ability to parse text details. 3. **Image Distortion**: Fixed cropping strategies may lead to image distortion, especially when processing high-resolution images. To address these challenges, the paper proposes the **AdaptVision** method, which dynamically adjusts the resolution of input images and the number of visual tokens to adapt to different types of images, thereby improving the model's understanding and processing capabilities across various scenarios. Specifically, the main contributions of AdaptVision include: - **Dynamic Resolution Adjustment**: Dynamically adjusts the resolution based on the size and aspect ratio of the input image, ensuring the use of an appropriate number of visual tokens and reducing distortion. - **Enhanced Text Parsing Capability**: Improves the model's performance on text-related tasks by expanding the text alignment instruction-following dataset to 100K samples. - **Extensive Experimental Validation**: Conducts extensive experiments on multiple datasets and tasks to demonstrate the effectiveness of the method. In summary, AdaptVision aims to improve the performance and efficiency of multimodal large language models in handling different types of images by dynamically adjusting the number and resolution of visual tokens.

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

InfMLLM: A Unified Framework for Visual-Language Tasks.

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks

A-VL: Adaptive Attention for Large Vision-Language Models

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

EVLM: An Efficient Vision-Language Model for Visual Understanding

Efficient Multi-modal Large Language Models via Visual Token Grouping

Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

VoCo-LLaMA: Towards Vision Compression with Large Language Models

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation