InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

Xiaoyi Dong,Pan Zhang,Yuhang Zang,Yuhang Cao,Bin Wang,Linke Ouyang,Songyang Zhang,Haodong Duan,Wenwei Zhang,Yining Li,Hang Yan,Yang Gao,Zhe Chen,Xinyue Zhang,Wei Li,Jingwen Li,Wenhai Wang,Kai Chen,Conghui He,Xingcheng Zhang,Jifeng Dai,Yu Qiao,Dahua Lin,Jiaqi Wang

2024-04-10

Abstract:The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution. Recent efforts have aimed to enhance the high-resolution understanding capabilities of LVLMs, yet they remain capped at approximately 1500 x 1500 pixels and constrained to a relatively narrow resolution range. This paper represents InternLM-XComposer2-4KHD, a groundbreaking exploration into elevating LVLM resolution capabilities up to 4K HD (3840 x 1600) and beyond. Concurrently, considering the ultra-high resolution may not be necessary in all scenarios, it supports a wide range of diverse resolutions from 336 pixels to 4K standard, significantly broadening its scope of applicability. Specifically, this research advances the patch division paradigm by introducing a novel extension: dynamic resolution with automatic patch configuration. It maintains the training image aspect ratios while automatically varying patch counts and configuring layouts based on a pre-trained Vision Transformer (ViT) (336 x 336), leading to dynamic training resolution from 336 pixels to 4K standard. Our research demonstrates that scaling training resolution up to 4K HD leads to consistent performance enhancements without hitting the ceiling of potential improvements. InternLM-XComposer2-4KHD shows superb capability that matches or even surpasses GPT-4V and Gemini Pro in 10 of the 16 benchmarks. The InternLM-XComposer2-4KHD model series with 7B parameters are publicly available at

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the insufficient ability of existing large - scale vision - language models (LVLMs) in handling high - resolution images. Specifically: 1. **High - resolution understanding ability**: Existing LVLMs can usually only handle images with a resolution of around 1500×1500 pixels, which limits their performance in tasks involving fine - grained visual content, such as diagrams, tables, documents, and infographics. These tasks require the model to be able to understand and process detailed information in high - resolution images. 2. **Limitations of the resolution range**: Existing methods are either limited to several predefined high - resolution settings or a limited resolution range, which restricts their practicality in various application scenarios. 3. **Dynamic resolution and automatic patch configuration**: To overcome the above problems, the paper introduces a new method - dynamic resolution and automatic patch configuration. This method can adaptively adjust the number and layout of patches while maintaining the original aspect ratio of the image, thus supporting a wide range of resolution from 336 pixels to the 4K standard. Through these innovations, the paper proposes the InternLM - XComposer2 - 4KHD model, which can perform excellently in handling high - resolution images and has achieved performance comparable to or even better than that of closed - source APIs in multiple benchmark tests.

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Dragonfly: Multi-Resolution Zoom-In Encoding Enhances Vision-Language Models

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

HumanVLM: Foundation for Human-Scene Vision-Language Model

TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model