Abstract:We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with texture semantic features. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces training difficulty and improves the performance of the unified model. The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are the deficiencies of existing vision - language models (VLMs) in multimodal understanding and generation tasks. In particular, the visual tokenizer is unable to effectively combine the semantic information of images, resulting in difficulties in aligning with text tokens, high training complexity, and the need for a large amount of training data, etc. Specifically: 1. **Limitations of existing visual tokenizers**: Existing visual tokenizers such as VQGAN mainly focus on low - level information and are difficult to align with texture semantic features, which makes them perform poorly in handling complex multimodal tasks. 2. **Alignment problem between visual and text tokens**: After converting visual information into discrete tokens, how to better align them with text tokens is a challenge. Existing methods are insufficient in this regard, leading to poor visual understanding performance. 3. **Training complexity and data requirements**: Due to the difficulty in aligning visual tokens with text tokens, existing models require a large amount of training data to achieve optimal performance, and the training process is complex. To overcome these problems, this paper proposes the Semantic Discrete Encoding (SDE) method. By introducing semantic constraints in the visual tokenization process, visual tokens can be better aligned with text tokens, thereby simplifying the training process and improving model performance. Based on this, the paper further proposes the MUSE - VL model, which significantly outperforms existing methods in multiple vision - language benchmark tests. ### Specific problem summary - **Problem 1**: How to incorporate semantic information in the visual tokenization process to improve the alignment effect between visual tokens and text tokens? - **Problem 2**: How to reduce the training complexity of vision - language models and the amount of required training data? - **Problem 3**: How to achieve better performance in multimodal understanding and generation tasks? By proposing the SDE method and the MUSE - VL model, the paper aims to solve the above problems and improve the performance of multimodal models in visual understanding and generation tasks.

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

Unified Generative and Discriminative Training for Multi-modal Large Language Models

VLIS: Unimodal Language Models Guide Multimodal Language Generation

Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization

Towards More Unified In-context Visual Understanding

EVLM: An Efficient Vision-Language Model for Visual Understanding

Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

MouSi: Poly-Visual-Expert Vision-Language Models

OCC-MLLM:Empowering Multimodal Large Language Model For the Understanding of Occluded Objects

VL-Meta: Vision-Language Models for Multimodal Meta-Learning

Unified Discrete Diffusion for Simultaneous Vision-Language Generation

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

LLaVA-SG: Leveraging Scene Graphs as Visual Semantic Expression in Vision-Language Models

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model