PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Rongyao Fang,Chengqi Duan,Kun Wang,Hao Li,Hao Tian,Xingyu Zeng,Rui Zhao,Jifeng Dai,Hongsheng Li,Xihui Liu
2024-10-21
Abstract:Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA demonstrates proficiency in a wide range of multimodal tasks. This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks. The code and model will be released in <a class="link-external link-https" href="https://github.com/rongyaofang/PUMA" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address the challenges faced by multimodal large language models (MLLM) in handling visual generation tasks of different granularities. Specifically, existing methods struggle with the trade-off between generating diverse images and precisely controlling image generation. For example, text-to-image generation tasks require high diversity and fidelity, while conditional generation and image editing tasks require high precision control over the images. To overcome this challenge, the authors propose **PUMA** (emPowering Unified MLLM with Multi-grAnular visual generation), a unified multimodal large language model capable of handling and generating visual representations of various granularities. By extracting and utilizing multi-scale features, PUMA elegantly addresses the granularity requirements of different visual generation tasks, thus achieving a balance between diversity and controllability. ### Main Contributions 1. **Multi-granular Feature Handling**: PUMA can simultaneously handle various features ranging from coarse-grained abstractions to fine-grained details, enabling it to manage a wide range of multimodal tasks within a unified framework. 2. **Broad Applicability to Multimodal Tasks**: PUMA excels in multiple tasks including image understanding, diverse text-to-image generation, editing, restoration, coloring, and conditional generation. 3. **Two-Stage Training Strategy**: PUMA employs a training strategy that combines large-scale pre-training with task-specific instruction tuning, allowing the model to perform excellently across various tasks. 4. **Experimental Validation**: Experiments on multiple benchmark datasets validate the effectiveness and superiority of PUMA in tasks such as image reconstruction, semantic-guided generation, diverse text-to-image generation, image editing, and conditional generation. ### Method Overview The PUMA method includes three key components: 1. **Image Encoder**: Extracts multi-granular image features, serving as the foundation for visual generation and understanding. 2. **Multi-granular Visual Decoder**: Generates images based on features of different granularities. 3. **Multi-granular Autoregressive MLLM**: Handles and generates text and multi-granular image features. Through the collaborative work of these components, PUMA excels in tasks with various granularity requirements, marking an important step towards achieving more general and powerful multimodal large language models.