Q-VLM: Post-training Quantization for Large Vision-Language Models

Changyuan Wang,Ziwei Wang,Xiuwei Xu,Yansong Tang,Jie Zhou,Jiwen Lu

2024-10-11

Abstract:In this paper, we propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. Conventional quantization methods sequentially search the layer-wise rounding functions by minimizing activation discretization errors, which fails to acquire optimal quantization strategy without considering cross-layer dependency. On the contrary, we mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy searching with low search cost. Specifically, we observe the strong correlation between the activation entropy and the cross-layer dependency concerning output discretization errors. Therefore, we employ the entropy as the proxy to partition blocks optimally, which aims to achieve satisfying trade-offs between discretization errors and the search cost. Moreover, we optimize the visual encoder to disentangle the cross-layer dependency for fine-grained decomposition of search space, so that the search cost is further reduced without harming the quantization accuracy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation on diverse multi-modal reasoning tasks. Code is available at <a class="link-external link-https" href="https://github.com/ChangyuanWang17/QVLM" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the efficient deployment of large vision - language models (LVLMs) in multimodal reasoning tasks. Although these models have achieved excellent performance in many multimodal reasoning tasks, such as visual question answering, embodied instruction following, and robot navigation, their extreme computational cost and storage requirements limit their practical applications on resource - limited mobile devices. Therefore, the paper proposes a post - training quantization framework called Q - VLM, aiming to accelerate the multimodal reasoning of LVLMs by reducing model complexity while maintaining the model performance without degradation. Specifically, the paper points out that traditional quantization methods sequentially search for the rounding function of each layer by minimizing the activation discretization error. This method ignores cross - layer dependencies and cannot obtain the optimal quantization strategy, thus significantly reducing the performance. To overcome this problem, Q - VLM mines cross - layer dependencies and embeds them into the optimized quantization strategy search to minimize the discretization error of the entire vision - language model at a lower search cost. In addition, Q - VLM also optimizes the visual encoder to decouple cross - layer dependencies, further reducing the search space and search cost without compromising quantization accuracy. The experimental results show that the Q - VLM method can compress the memory by 2.78 times, increase the generation speed by 1.44 times, and there is no performance degradation in diverse multimodal reasoning tasks.

Q-VLM: Post-training Quantization for Large Vision-Language Models

MBQ: Modality-Balanced Quantization for Large Vision-Language Models

VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models

Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation

High Efficiency Image Compression for Large Visual-Language Models

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference

Efficient Large Multi-modal Models via Visual Context Compression

QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models

ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Post Training Quantization of Large Language Models with Microscaling Formats

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

VoCo-LLaMA: Towards Vision Compression with Large Language Models

WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More

P4Q: Learning to Prompt for Quantization in Visual-language Models

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

LQER: Low-Rank Quantization Error Reconstruction for LLMs

GPTVQ: The Blessing of Dimensionality for LLM Quantization

Resizing codebook of vector quantization without retraining