ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models

Chunjiang Ge,Sijie Cheng,Ziming Wang,Jiale Yuan,Yuan Gao,Jun Song,Shiji Song,Gao Huang,Bo Zheng
2024-05-25
Abstract:High-resolution Large Multimodal Models (LMMs) encounter the challenges of excessive visual tokens and quadratic visual complexity. Current high-resolution LMMs address the quadratic complexity while still generating excessive visual tokens. However, the redundancy in visual tokens is the key problem as it leads to more substantial compute. To mitigate this issue, we propose ConvLLaVA, which employs ConvNeXt, a hierarchical backbone, as the visual encoder of LMM to replace Vision Transformer (ViT). ConvLLaVA compresses high-resolution images into information-rich visual features, effectively preventing the generation of excessive visual tokens. To enhance the capabilities of ConvLLaVA, we propose two critical optimizations. Since the low-resolution pretrained ConvNeXt underperforms when directly applied on high resolution, we update it to bridge the gap. Moreover, since ConvNeXt's original compression ratio is inadequate for much higher resolution inputs, we train a successive stage to further compress the visual tokens, thereby reducing redundancy. These optimizations enable ConvLLaVA to support inputs of 1536x1536 resolution generating only 576 visual tokens, capable of handling images of arbitrary aspect ratios. Experimental results demonstrate that our method achieves competitive performance with state-of-the-art models on mainstream benchmarks. The ConvLLaVA model series are publicly available at
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper mainly discusses the challenges faced by Large Multimodal Models (LMMs) in processing high-resolution images, namely the redundancy of visual tokens and the quadratic complexity of visual information. Although current methods solve the quadratic complexity issue, they still generate excessive visual tokens, which increases the computational burden. To address this, the paper proposes ConvLLaVA, a hierarchical model that uses ConvNeXt as the visual encoder, replacing Vision Transformer (ViT). ConvLLaVA effectively avoids the generation of excessive visual tokens by compressing high-resolution images into information-rich visual features. The paper mentions that ConvNeXt has linear spatial complexity and generates fewer visual tokens than ViT at the same resolution, reducing the computational burden of Large Language Models (LLMs). To enhance the capabilities of ConvLLaVA, the paper proposes two key optimizations: updating the pre-training of ConvNeXt on low-resolution images to adapt to high-resolution images, and further compressing visual tokens through an additional ConvNeXt stage to reduce redundancy. These optimizations enable ConvLLaVA to support input resolutions up to 1536×1536 while generating only 576 visual tokens. Experimental results demonstrate that ConvLLaVA performs comparably to state-of-the-art models on mainstream benchmark tests and has advantages in handling images with arbitrary aspect ratios. The paper also compares other methods, such as using additional visual encoders or cropping methods, and points out their limitations and efficiency issues. In summary, this paper addresses the problem of effectively processing high-resolution images and reducing the computational cost and redundancy of visual tokens in Large Multimodal Models without sacrificing performance.