Visual Perception by Large Language Model's Weights

Feipeng Ma,Hongwei Xue,Guangting Wang,Yizhou Zhou,Fengyun Rao,Shilin Yan,Yueyi Zhang,Siying Wu,Mike Zheng Shou,Xiaoyan Sun

2024-05-31

Abstract:Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM's weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to perceptual weights with low-rank property, exhibiting a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference. The code and models will be made open-source.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

This paper mainly explores how to enable large language models (LLMs) to process visual information, in order to achieve multimodal large language models (MLLMs). The current approach is to align visual features with the input space of LLMs, and then concatenate visual tokens with text tokens to form a unified input sequence. Although this method performs well on various visual language tasks, the efficiency is low due to the elongation of the input sequence. The paper proposes a new parameter space alignment paradigm, which represents visual information using model weights. For each input image, visual encoder is used to extract features, which are then transformed into perceptual weights and merged with the weights of LLMs. In this way, the input of LLM does not need visual tokens, reducing the length of the input sequence and significantly improving efficiency. For this purpose, the paper proposes VLoRA, which includes a perceptual weight generator that can transform visual features into low-rank perceptual weights in a similar form to LoRA weights. Experimental results show that VLoRA performs comparably to state-of-the-art MLLMs on multiple benchmark tests, while greatly reducing the computational cost of training and inference. In summary, the paper aims to improve the computational efficiency of multimodal large language models in handling visual information and proposes a new approach, VLoRA, by transforming visual information into model weights to reduce computational burden and improve performance.

Visual Perception by Large Language Model's Weights

PerceptionGPT: Effectively Fusing Visual Perception into LLM

Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

Large Language Models Are Strong Audio-Visual Speech Recognition Learners

Multi-modal Auto-regressive Modeling via Visual Words

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

InfMLLM: A Unified Framework for Visual-Language Tasks.

Visually-Augmented Language Modeling

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

VoCo-LLaMA: Towards Vision Compression with Large Language Models

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

FoPru: Focal Pruning for Efficient Large Vision-Language Models

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference