Abstract:In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of the unity of visual encoding in multimodal understanding and generation tasks. Specifically, existing multimodal models usually rely on a single visual encoder to handle understanding (such as image recognition, captioning, etc.) and generation (such as generating an image according to text) tasks. However, these two types of tasks have different requirements for information granularity: multimodal understanding needs to extract high - level semantic information, while visual generation is more concerned with generating local details and maintaining global consistency. This difference leads to conflicts and trade - offs when representing these two tasks uniformly in the same space, thus affecting performance. To solve this problem, the authors propose a new framework named Janus. Janus decouples the visual encoding paths, designs independent encoders for multimodal understanding and generation tasks respectively, and processes them under a unified Transformer architecture. This method not only alleviates the conflict of the visual encoder between understanding and generation tasks but also improves the flexibility and extensibility of the framework. #### Benefits of decoupling visual encoding: 1. **Reduce conflicts between tasks**: By choosing the most appropriate encoding method for each task, the compromise in choosing a visual encoder is avoided. 2. **Improve flexibility**: It allows each task to adopt the most advanced encoding techniques and can be extended to support other input types (such as point clouds, EEG signals or audio data) in the future. The experimental results show that Janus outperforms existing unified models on both multimodal understanding and generation benchmarks, and even surpasses specialized task - specific models in some tasks. This indicates that Janus has strong performance, high flexibility and good extensibility, and is a strong candidate for the next - generation multimodal models. #### Formula representation: - The cross - entropy loss function used in the training process is: \[ L = -\sum_{i = 1}^n\log P_\theta(x_i|x_{< i}) \] where \(P(·|·)\) represents the conditional probability modeled by the weight parameter \(\theta\) of Janus. In summary, Janus solves the problem of information granularity differences in multimodal understanding and generation tasks by decoupling the visual encoding paths, significantly improving the overall performance and flexibility of the model.

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Towards More Unified In-context Visual Understanding

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

Meta-Transformer: A Unified Framework for Multimodal Learning

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Imagination-Augmented Natural Language Understanding

Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training

ERNIE-UniX2: A Unified Cross-lingual Cross-modal Framework for Understanding and Generation

UnifiedVisionGPT: Streamlining Vision-Oriented AI through Generalized Multimodal Framework

MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

Everything is a Video: Unifying Modalities through Next-Frame Prediction

Unifying Multimodal Transformer for Bi-directional Image and Text Generation

Unifying Visual Perception by Dispersible Points Learning

UNIT: Unifying Image and Text Recognition in One Vision Encoder

UniNeXt: Exploring A Unified Architecture for Vision Recognition

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond

OmniVid: A Generative Framework for Universal Video Understanding