VoCo-LLaMA: Towards Vision Compression with Large Language Models

Xubing Ye,Yukang Gan,Xiaoke Huang,Yixiao Ge,Ying Shan,Yansong Tang

2024-06-18

Abstract:Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576$\times$, resulting in up to 94.8$\%$ fewer FLOPs and 69.6$\%$ acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via $\href{<a class="link-external link-https" href="https://yxxxb.github.io/VoCo-LLaMA-page/" rel="external noopener nofollow">this https URL</a>}{\text{this https URL}}$.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem addressed in this paper is how to effectively compress visual information to alleviate the contextual window limitation and high computational cost faced by large language models (VLMs) when dealing with high-resolution images and videos. Existing methods compress visual tokens through external modules, but may lead to loss of visual information. The paper proposes VoCo-LLaMA, which is the first method to compress visual tokens using the capabilities of large language models themselves. By introducing VisionCompression tokens and attention distillation, VoCo-LLaMA transfers the way the language model understands original visual tokens to processing compressed tokens, thereby achieving effective visual compression and improving computational efficiency during inference. With a compression ratio of up to 576 times, VoCo-LLaMA only loses about 16.3% of performance, reducing 94.8% of FLOPs and 69.6% of inference time. Furthermore, by training continuously to compress tokens in the temporal sequence of video frames, VoCo-LLaMA is able to understand temporal correlations and performs well in video question-answering benchmarks. In summary, the paper attempts to address the problem of efficiently compressing visual data in order to enable large language models to better handle high-resolution image and video tasks without sacrificing too much performance.

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

Efficient Large Multi-modal Models via Visual Context Compression

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

High Efficiency Image Compression for Large Visual-Language Models

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

LongVLM: Efficient Long Video Understanding via Large Language Models

DyCoke: Dynamic Compression of Tokens for Fast Video Large Language Models

Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Efficient Multi-modal Large Language Models via Visual Token Grouping

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks