MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Rongchang Xie,Chen Du,Ping Song,Chang Liu
2024-11-26
Abstract:We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and understanding. However, existing vision tokenizers (e.g., VQGAN) only consider low-level information, which makes it difficult to align with texture semantic features. This results in high training complexity and necessitates a large amount of training data to achieve optimal performance. Additionally, their performance is still far from dedicated understanding models. This paper proposes Semantic Discrete Encoding (SDE), which effectively aligns the information of visual tokens and language tokens by adding semantic constraints to the visual tokenizer. This greatly reduces training difficulty and improves the performance of the unified model. The proposed model significantly surpasses the previous state-of-the-art in various vision-language benchmarks and achieves better performance than dedicated understanding models.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are the deficiencies of existing vision - language models (VLMs) in multimodal understanding and generation tasks. In particular, the visual tokenizer is unable to effectively combine the semantic information of images, resulting in difficulties in aligning with text tokens, high training complexity, and the need for a large amount of training data, etc. Specifically: 1. **Limitations of existing visual tokenizers**: Existing visual tokenizers such as VQGAN mainly focus on low - level information and are difficult to align with texture semantic features, which makes them perform poorly in handling complex multimodal tasks. 2. **Alignment problem between visual and text tokens**: After converting visual information into discrete tokens, how to better align them with text tokens is a challenge. Existing methods are insufficient in this regard, leading to poor visual understanding performance. 3. **Training complexity and data requirements**: Due to the difficulty in aligning visual tokens with text tokens, existing models require a large amount of training data to achieve optimal performance, and the training process is complex. To overcome these problems, this paper proposes the Semantic Discrete Encoding (SDE) method. By introducing semantic constraints in the visual tokenization process, visual tokens can be better aligned with text tokens, thereby simplifying the training process and improving model performance. Based on this, the paper further proposes the MUSE - VL model, which significantly outperforms existing methods in multiple vision - language benchmark tests. ### Specific problem summary - **Problem 1**: How to incorporate semantic information in the visual tokenization process to improve the alignment effect between visual tokens and text tokens? - **Problem 2**: How to reduce the training complexity of vision - language models and the amount of required training data? - **Problem 3**: How to achieve better performance in multimodal understanding and generation tasks? By proposing the SDE method and the MUSE - VL model, the paper aims to solve the above problems and improve the performance of multimodal models in visual understanding and generation tasks.