ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Chunjiang Ge,Sijie Cheng,Ziming Wang,Jiale Yuan,Yuan Gao,Jun Song,Shiji Song,Gao Huang,Bo Zheng

2024-05-25

Abstract:High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper mainly discusses the challenges faced by Large Multimodal Models (LMMs) in processing high-resolution images, namely the redundancy of visual tokens and the quadratic complexity of visual information. Although current methods solve the quadratic complexity issue, they still generate excessive visual tokens, which increases the computational burden. To address this, the paper proposes ConvLLaVA, a hierarchical model that uses ConvNeXt as the visual encoder, replacing Vision Transformer (ViT). ConvLLaVA effectively avoids the generation of excessive visual tokens by compressing high-resolution images into information-rich visual features. The paper mentions that ConvNeXt has linear spatial complexity and generates fewer visual tokens than ViT at the same resolution, reducing the computational burden of Large Language Models (LLMs). To enhance the capabilities of ConvLLaVA, the paper proposes two key optimizations: updating the pre-training of ConvNeXt on low-resolution images to adapt to high-resolution images, and further compressing visual tokens through an additional ConvNeXt stage to reduce redundancy. These optimizations enable ConvLLaVA to support input resolutions up to 1536×1536 while generating only 576 visual tokens. Experimental results demonstrate that ConvLLaVA performs comparably to state-of-the-art models on mainstream benchmark tests and has advantages in handling images with arbitrary aspect ratios. The paper also compares other methods, such as using additional visual encoders or cropping methods, and points out their limitations and efficiency issues. In summary, this paper addresses the problem of effectively processing high-resolution images and reducing the computational cost and redundancy of visual tokens in Large Multimodal Models without sacrificing performance.

ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

AVG-LLaVA: A Large Multimodal Model with Adaptive Visual Granularity

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Efficient Large Multi-modal Models via Visual Context Compression

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

Efficient Multi-modal Large Language Models via Visual Token Grouping

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

High Efficiency Image Compression for Large Visual-Language Models

Visual Perception by Large Language Model's Weights