Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Chengyue Wu,Xiaokang Chen,Zhiyu Wu,Yiyang Ma,Xingchao Liu,Zizheng Pan,Wen Liu,Zhenda Xie,Xingkai Yu,Chong Ruan,Ping Luo
2024-10-18
Abstract:In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the problem of the unity of visual encoding in multimodal understanding and generation tasks. Specifically, existing multimodal models usually rely on a single visual encoder to handle understanding (such as image recognition, captioning, etc.) and generation (such as generating an image according to text) tasks. However, these two types of tasks have different requirements for information granularity: multimodal understanding needs to extract high - level semantic information, while visual generation is more concerned with generating local details and maintaining global consistency. This difference leads to conflicts and trade - offs when representing these two tasks uniformly in the same space, thus affecting performance. To solve this problem, the authors propose a new framework named Janus. Janus decouples the visual encoding paths, designs independent encoders for multimodal understanding and generation tasks respectively, and processes them under a unified Transformer architecture. This method not only alleviates the conflict of the visual encoder between understanding and generation tasks but also improves the flexibility and extensibility of the framework. #### Benefits of decoupling visual encoding: 1. **Reduce conflicts between tasks**: By choosing the most appropriate encoding method for each task, the compromise in choosing a visual encoder is avoided. 2. **Improve flexibility**: It allows each task to adopt the most advanced encoding techniques and can be extended to support other input types (such as point clouds, EEG signals or audio data) in the future. The experimental results show that Janus outperforms existing unified models on both multimodal understanding and generation benchmarks, and even surpasses specialized task - specific models in some tasks. This indicates that Janus has strong performance, high flexibility and good extensibility, and is a strong candidate for the next - generation multimodal models. #### Formula representation: - The cross - entropy loss function used in the training process is: \[ L = -\sum_{i = 1}^n\log P_\theta(x_i|x_{< i}) \] where \(P(·|·)\) represents the conditional probability modeled by the weight parameter \(\theta\) of Janus. In summary, Janus solves the problem of information granularity differences in multimodal understanding and generation tasks by decoupling the visual encoding paths, significantly improving the overall performance and flexibility of the model.