Abstract:Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the inefficiency of visual - language models (VLMs). Although VLMs have made remarkable progress in accuracy in recent years, their training, fine - tuning, and deployment costs are still very high, and there are problems of excessive consumption of computational resources when processing high - resolution images and long - videos. Specifically: 1. **High training cost**: Training an advanced 7 - billion - parameter VLM may require as many as 400 GPU days, which is a huge entry barrier for researchers. 2. **Large fine - tuning memory footprint**: Fully fine - tuning a 7 - billion - parameter VLM may require more than 64GB of GPU memory, which is beyond the capabilities of most consumer - grade GPUs. 3. **Limited deployment resources**: Deploying VLMs on edge devices (such as laptops, robots) is limited by the computational budget. To solve these problems, the paper introduced NVILA, an open - source VLM family aimed at optimizing efficiency and accuracy. NVILA improves the efficiency of VLMs through the following methods: - **"Scale - Then - Compress" strategy**: First, expand the spatial and temporal resolutions to improve accuracy, and then compress the visual tokens to improve computational efficiency. This strategy enables NVILA to efficiently process high - resolution images and long - videos. - **Systematic optimization of the life cycle**: From training, fine - tuning to deployment, NVILA has carried out comprehensive efficiency optimization. For example: - The training cost has been reduced by 4.5 times. - The fine - tuning memory usage has been reduced by 3.4 times. - The pre - filling latency has been reduced by 1.6 - 2.2 times. - The decoding latency has been reduced by 1.2 - 2.8 times. In addition, NVILA has matched or surpassed the performance of many leading open - source and proprietary VLMs in multiple image and video benchmark tests while achieving significant efficiency improvements. The paper also shows the application potential of NVILA in specific fields, such as temporal localization, robot navigation, and medical imaging. In conclusion, the goal of this paper is to lower the threshold for using VLMs and expand their application scenarios by designing an efficient VLM architecture and optimizing their efficiency throughout their life cycle.

NVILA: Efficient Frontier Visual Language Models

ZipVL: Efficient Large Vision-Language Models with Dynamic Token Sparsification and KV Cache Compression

NVLM: Open Frontier-Class Multimodal LLMs

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

VisuaLizations As Intermediate Representations (VLAIR): an Approach for Applying Deep Learning-Based Computer Vision to Non-Image-based Data

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

EVLM: An Efficient Vision-Language Model for Visual Understanding

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

CogVLM: Visual Expert for Pretrained Language Models

SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

A-VL: Adaptive Attention for Large Vision-Language Models

High Efficiency Image Compression for Large Visual-Language Models