Abstract:Recent advancements in multimodal foundation models have yielded significant progress in vision-language understanding. Initial attempts have also explored the potential of multimodal large language models (MLLMs) for visual content generation. However, existing works have insufficiently addressed the varying granularity demands of different image generation tasks within a unified MLLM paradigm - from the diversity required in text-to-image generation to the precise controllability needed in image manipulation. In this work, we propose PUMA, emPowering Unified MLLM with Multi-grAnular visual generation. PUMA unifies multi-granular visual features as both inputs and outputs of MLLMs, elegantly addressing the different granularity requirements of various image generation tasks within a unified MLLM framework. Following multimodal pretraining and task-specific instruction tuning, PUMA demonstrates proficiency in a wide range of multimodal tasks. This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks. The code and model will be released in <a class="link-external link-https" href="https://github.com/rongyaofang/PUMA" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address the challenges faced by multimodal large language models (MLLM) in handling visual generation tasks of different granularities. Specifically, existing methods struggle with the trade-off between generating diverse images and precisely controlling image generation. For example, text-to-image generation tasks require high diversity and fidelity, while conditional generation and image editing tasks require high precision control over the images. To overcome this challenge, the authors propose **PUMA** (emPowering Unified MLLM with Multi-grAnular visual generation), a unified multimodal large language model capable of handling and generating visual representations of various granularities. By extracting and utilizing multi-scale features, PUMA elegantly addresses the granularity requirements of different visual generation tasks, thus achieving a balance between diversity and controllability. ### Main Contributions 1. **Multi-granular Feature Handling**: PUMA can simultaneously handle various features ranging from coarse-grained abstractions to fine-grained details, enabling it to manage a wide range of multimodal tasks within a unified framework. 2. **Broad Applicability to Multimodal Tasks**: PUMA excels in multiple tasks including image understanding, diverse text-to-image generation, editing, restoration, coloring, and conditional generation. 3. **Two-Stage Training Strategy**: PUMA employs a training strategy that combines large-scale pre-training with task-specific instruction tuning, allowing the model to perform excellently across various tasks. 4. **Experimental Validation**: Experiments on multiple benchmark datasets validate the effectiveness and superiority of PUMA in tasks such as image reconstruction, semantic-guided generation, diverse text-to-image generation, image editing, and conditional generation. ### Method Overview The PUMA method includes three key components: 1. **Image Encoder**: Extracts multi-granular image features, serving as the foundation for visual generation and understanding. 2. **Multi-granular Visual Decoder**: Generates images based on features of different granularities. 3. **Multi-granular Autoregressive MLLM**: Handles and generates text and multi-granular image features. Through the collaborative work of these components, PUMA excels in tasks with various granularity requirements, marking an important step towards achieving more general and powerful multimodal large language models.

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

InfMLLM: A Unified Framework for Visual-Language Tasks.

UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model

Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants

MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

Unified Generative and Discriminative Training for Multi-modal Large Language Models

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

MuLan: Multimodal-LLM Agent for Progressive and Interactive Multi-Object Diffusion

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning