Abstract:VILA-U is a Unified foundation model that integrates Video, Image, Language understanding and generation. Traditional visual language models (VLMs) use separate modules for understanding and generating visual content, which can lead to misalignment and increased complexity. In contrast, VILA-U employs a single autoregressive next-token prediction framework for both tasks, eliminating the need for additional components like diffusion models. This approach not only simplifies the model but also achieves near state-of-the-art performance in visual language understanding and generation. The success of VILA-U is attributed to two main factors: the unified vision tower that aligns discrete visual tokens with textual inputs during pretraining, which enhances visual perception, and autoregressive image generation can achieve similar quality as diffusion models with high-quality dataset. This allows VILA-U to perform comparably to more complex models using a fully token-based autoregressive framework.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the separation problem existing in the existing vision - language models (VLMs) when understanding and generating visual content. Traditional VLMs usually use independent modules to handle the tasks of visual content understanding and generation, which may lead to inconsistency between tasks and an increase in model complexity. VILA - U simultaneously handles these two tasks by adopting a single autoregressive next - token prediction framework, eliminating the need for additional components (such as diffusion models). This method not only simplifies the model structure but also achieves performance close to the state - of - the - art in visual - language understanding and generation. Specifically, VILA - U mainly solves the following two key problems: 1. **Visual Understanding and Text Alignment**: Existing end - to - end autoregressive VLMs are inferior to VLMs with continuous tokens in visual understanding performance because discrete vector - quantized (VQ) tokens are only trained based on image reconstruction loss and are not aligned with text input. Therefore, VILA - U introduces text alignment during the pre - training process of the VQ visual tower to enhance perception ability. 2. **High - Quality Visual Generation**: Autoregressive image generation can achieve a quality similar to that of diffusion models if trained on high - quality datasets. VILA - U utilizes a high - quality small - scale image - text corpus for multimodal training and adopts a unified next - token prediction target to handle visual and text tokens. Through these methods, VILA - U performs excellently in visual - language understanding and generation tasks without relying on external components such as diffusion models. This gives VILA - U significant advantages in both performance and efficiency.

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization

NVILA: Efficient Frontier Visual Language Models

VILA$^2$: VILA Augmented VILA

EVLM: An Efficient Vision-Language Model for Visual Understanding

VILA: On Pre-training for Visual Language Models

Visually-Augmented Language Modeling

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

Unified Lexical Representation for Interpretable Visual-Language Alignment

X-VILA: Cross-Modality Alignment for Large Language Model

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

VLIS: Unimodal Language Models Guide Multimodal Language Generation

LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via Hierarchical Window Transformer

Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild