Abstract:We present TokenFlow, a novel unified image tokenizer that bridges the long-standing gap between multimodal understanding and generation. Prior research attempt to employ a single reconstruction-targeted Vector Quantization (VQ) encoder for unifying these two tasks. We observe that understanding and generation require fundamentally different granularities of visual information. This leads to a critical trade-off, particularly compromising performance in multimodal understanding tasks. TokenFlow addresses this challenge through an innovative dual-codebook architecture that decouples semantic and pixel-level feature learning while maintaining their alignment via a shared mapping mechanism. This design enables direct access to both high-level semantic representations crucial for understanding tasks and fine-grained visual features essential for generation through shared indices. Our extensive experiments demonstrate TokenFlow's superiority across multiple dimensions. Leveraging TokenFlow, we demonstrate for the first time that discrete visual input can surpass LLaVA-1.5 13B in understanding performance, achieving a 7.2\% average improvement. For image reconstruction, we achieve a strong FID score of 0.63 at 384*384 resolution. Moreover, TokenFlow establishes state-of-the-art performance in autoregressive image generation with a GenEval score of 0.55 at 256*256 resolution, achieving comparable results to SDXL.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to construct a unified image tokenizer that can efficiently handle multimodal understanding and generation tasks simultaneously?** Specifically, existing methods usually use different architectures and encoding methods when dealing with multimodal understanding and generation tasks, which leads to an increase in model complexity and an inability to well - balance the representation of semantic information and pixel - level details. ### Problem Background 1. **Differences between Multimodal Understanding and Generation**: - Multimodal understanding tasks require rich semantic representations to support complex reasoning. - Visual generation tasks require accurate encoding of spatial structures and texture details. 2. **Limitations of Existing Methods**: - Most current methods use vector quantization (VQ) encoders optimized for reconstruction. These encoders mainly optimize low - level reconstruction quality, thus limiting their ability to capture high - level semantic features. - Although some methods attempt to separately handle understanding and generation tasks by separating encoders, this increases the model's complexity and fails to fundamentally solve the differences in representation. ### Core Problem of the Paper The key question raised in the paper is: **Can a single image tokenizer be used to extract representations suitable for multimodal understanding and generation tasks?** ### Solution To solve the above problems, the authors propose **TokenFlow**, a new type of unified image tokenizer. The main innovations of TokenFlow include: - **Dual - codebook Architecture**: - TokenFlow introduces two independent codebooks, which are used to learn semantic features and pixel - level features respectively, and at the same time maintains their consistency through a shared mapping mechanism. - This design enables the model to directly access high - level semantic representations and fine - grained visual features, thus performing well in understanding and generation tasks. - **Joint Optimization Mechanism**: - During the quantization process, TokenFlow considers both the semantic distance \(d_{\text{sem}}\) and the pixel distance \(d_{\text{pix}}\), and determines the optimal quantization index \(i^*\) by minimizing the weighted sum \(d_{\text{sem}}+w_{\text{dis}}\cdot d_{\text{pix}}\). - The formula is as follows: \[ i^*=\arg\min_i(d_{\text{sem},i}+w_{\text{dis}}\cdot d_{\text{pix},i}) \] - Here, \(w_{\text{dis}}\) is the distance - balancing weight. ### Experimental Results The experimental results show that TokenFlow has superiority in multiple dimensions: - **Multimodal Understanding Tasks**: TokenFlow is the first to prove that discrete visual input can outperform LLaVA - 1.5 13B, with an average 7.2% improvement in understanding performance. - **Image Reconstruction**: At a resolution of 384×384, TokenFlow achieves an FID score of 0.63. - **Autoregressive Image Generation**: At a resolution of 256×256, TokenFlow achieves a GenEval score of 0.55, which is competitive compared to SDXL. In summary, through its unique dual - codebook design and joint optimization mechanism, TokenFlow successfully bridges the gap between multimodal understanding and generation tasks and provides a more general and efficient solution.

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation

Image Understanding Makes for A Good Tokenizer for Image Generation

Factorized Visual Tokenization and Generation

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

An Image is Worth 32 Tokens for Reconstruction and Generation

Token Fusion: Bridging the Gap between Token Pruning and Token Merging

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Language-Guided Image Tokenization for Generation

All in Tokens: Unifying Output Space of Visual Tasks Via Soft Token

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

Adaptive Length Image Tokenization via Recurrent Allocation

Multimodal Token Fusion for Vision Transformers

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners

Efficient Vision Transformer via Token Merger

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

ImageFolder: Autoregressive Image Generation with Folded Tokens

DualToken-ViT: Position-aware Efficient Vision Transformer with Dual Token Fusion

A Spitting Image: Modular Superpixel Tokenization in Vision Transformers

Making Vision Transformers Efficient from A Token Sparsification View