High Efficiency Image Compression for Large Visual-Language Models

Binzhe Li,Shurun Wang,Shiqi Wang,Yan Ye
2024-07-24
Abstract:In recent years, large visual language models (LVLMs) have shown impressive performance and promising generalization capability in multi-modal tasks, thus replacing humans as receivers of visual information in various application scenarios. In this paper, we pioneer to propose a variable bitrate image compression framework consisting of a pre-editing module and an end-to-end codec to achieve promising rate-accuracy performance for different LVLMs. In particular, instead of optimizing an adaptive pre-editing network towards a particular task or several representative tasks, we propose a new optimization strategy tailored for LVLMs, which is designed based on the representation and discrimination capability with token-level distortion and rank. The pre-editing module and the variable bitrate end-to-end image codec are jointly trained by the losses based on semantic tokens of the large model, which introduce enhanced generalization capability for various data and tasks. {Experimental results demonstrate that the proposed framework could efficiently achieve much better rate-accuracy performance compared to the state-of-the-art coding standard, Versatile Video Coding.} Meanwhile, experiments with multi-modal tasks have revealed the robustness and generalization capability of the proposed framework.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language,Image and Video Processing
What problem does this paper attempt to address?
The paper aims to address the issue of image compression requirements for Large Vision-Language Models (LVLMs) in multimodal tasks. Specifically, since LVLMs typically need to handle various tasks and receive a large amount of visual information, existing image compression standards and techniques cannot meet these models' semantic feature needs. The paper proposes a new variable bit-rate image compression framework, which includes a pre-editing module and an end-to-end codec to achieve efficient and adaptive compression performance. The main contributions of the paper are as follows: 1. Proposes an image compression scheme for LVLMs, including a semantic-driven pre-editor and codec. This scheme optimizes the entire compression process through semantic information rather than optimizing for specific tasks. 2. Develops a pre-editing module guided by large model annotations, designed to retain key semantic information while discarding irrelevant semantic information, thereby minimizing bit-rate consumption. 3. Enhances semantic consistency by imposing supervision on the ordering of large model annotations during the compression process. This method assumes that the ordering of annotations reflects semantic richness and further optimizes the performance of machine tasks by maintaining the ordering of annotations in the reconstructed visual signals. Experimental results show that compared to current state-of-the-art video coding standards (such as Versatile Video Coding, VVC), this framework can significantly improve compression efficiency and accuracy under different multimodal tasks. Additionally, the method demonstrates stronger generalization capabilities in various applications.