NVILA: Efficient Frontier Visual Language Models

Zhijian Liu,Ligeng Zhu,Baifeng Shi,Zhuoyang Zhang,Yuming Lou,Shang Yang,Haocheng Xi,Shiyi Cao,Yuxian Gu,Dacheng Li,Xiuyu Li,Yunhao Fang,Yukang Chen,Cheng-Yu Hsieh,De-An Huang,An-Chieh Cheng,Vishwesh Nath,Jinyi Hu,Sifei Liu,Ranjay Krishna,Daguang Xu,Xiaolong Wang,Pavlo Molchanov,Jan Kautz,Hongxu Yin,Song Han,Yao Lu
2024-12-06
Abstract:Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the inefficiency of visual - language models (VLMs). Although VLMs have made remarkable progress in accuracy in recent years, their training, fine - tuning, and deployment costs are still very high, and there are problems of excessive consumption of computational resources when processing high - resolution images and long - videos. Specifically: 1. **High training cost**: Training an advanced 7 - billion - parameter VLM may require as many as 400 GPU days, which is a huge entry barrier for researchers. 2. **Large fine - tuning memory footprint**: Fully fine - tuning a 7 - billion - parameter VLM may require more than 64GB of GPU memory, which is beyond the capabilities of most consumer - grade GPUs. 3. **Limited deployment resources**: Deploying VLMs on edge devices (such as laptops, robots) is limited by the computational budget. To solve these problems, the paper introduced NVILA, an open - source VLM family aimed at optimizing efficiency and accuracy. NVILA improves the efficiency of VLMs through the following methods: - **"Scale - Then - Compress" strategy**: First, expand the spatial and temporal resolutions to improve accuracy, and then compress the visual tokens to improve computational efficiency. This strategy enables NVILA to efficiently process high - resolution images and long - videos. - **Systematic optimization of the life cycle**: From training, fine - tuning to deployment, NVILA has carried out comprehensive efficiency optimization. For example: - The training cost has been reduced by 4.5 times. - The fine - tuning memory usage has been reduced by 3.4 times. - The pre - filling latency has been reduced by 1.6 - 2.2 times. - The decoding latency has been reduced by 1.2 - 2.8 times. In addition, NVILA has matched or surpassed the performance of many leading open - source and proprietary VLMs in multiple image and video benchmark tests while achieving significant efficiency improvements. The paper also shows the application potential of NVILA in specific fields, such as temporal localization, robot navigation, and medical imaging. In conclusion, the goal of this paper is to lower the threshold for using VLMs and expand their application scenarios by designing an efficient VLM architecture and optimizing their efficiency throughout their life cycle.