Abstract:Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks. However, their real-world deployment is often constrained by high latency during inference due to substantial compute required to process the large number of input tokens (predominantly from the image) by the LLM. To reduce inference costs, one can either downsize the LLM or reduce the number of input image-tokens, the latter of which has been the focus of many recent works around token compression. However, it is unclear what the optimal trade-off is, as both the factors directly affect the VLM performance. We first characterize this optimal trade-off between the number of visual tokens and LLM parameters by establishing scaling laws that capture variations in performance with these two factors. Our results reveal a surprising trend: for visual reasoning tasks, the inference-optimal behavior in VLMs, i.e., minimum downstream error at any given fixed inference compute, is achieved when using the largest LLM that fits within the inference budget while minimizing visual token count - often to a single token. While the token reduction literature has mainly focused on maintaining base model performance by modestly reducing the token count (e.g., $5-10\times$), our results indicate that the compute-optimal inference regime requires operating under even higher token compression ratios. Based on these insights, we take some initial steps towards building approaches tailored for high token compression settings. Code is available at <a class="link-external link-https" href="https://github.com/locuslab/llava-token-compression" rel="external noopener nofollow">this https URL</a>.

TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction

LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Leveraging Zero-Shot Prompting for Efficient Language Model Distillation

Rational Metareasoning for Large Language Models

An Efficient Multilingual Language Model Compression through Vocabulary Trimming

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics

Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

Small Language Models Improve Giants by Rewriting Their Outputs

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Cost-Effective Hyperparameter Optimization for Large Language Model Generation Inference

Not All Layers of LLMs Are Necessary During Inference

Mixed Distillation Helps Smaller Language Model Better Reasoning

Inference Optimal VLMs Need Only One Visual Token but Larger Models

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Inference acceleration for large language models using "stairs" assisted greedy generation

Learning to Reduce: Optimal Representations of Structured Data in Prompting Large Language Models

Tiny Titans: Can Smaller Large Language Models Punch Above Their Weight in the Real World for Meeting Summarization?