Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Yongqi Li,Hongru Cai,Wenjie Wang,Leigang Qu,Yinwei Wei,Wenjie Li,Liqiang Nie,Tat-Seng Chua

2024-07-24

Abstract:Text-to-image retrieval is a fundamental task in multimedia processing, aiming to retrieve semantically relevant cross-modal content. Traditional studies have typically approached this task as a discriminative problem, matching the text and image via the cross-attention mechanism (one-tower framework) or in a common embedding space (two-tower framework). Recently, generative cross-modal retrieval has emerged as a new research line, which assigns images with unique string identifiers and generates the target identifier as the retrieval target. Despite its great potential, existing generative approaches are limited due to the following issues: insufficient visual information in identifiers, misalignment with high-level semantics, and learning gap towards the retrieval target. To address the above issues, we propose an autoregressive voken generation method, named AVG. AVG tokenizes images into vokens, i.e., visual tokens, and innovatively formulates the text-to-image retrieval task as a token-to-voken generation problem. AVG discretizes an image into a sequence of vokens as the identifier of the image, while maintaining the alignment with both the visual information and high-level semantics of the image. Additionally, to bridge the learning gap between generative training and the retrieval target, we incorporate discriminative training to modify the learning direction during token-to-voken training. Extensive experiments demonstrate that AVG achieves superior results in both effectiveness and efficiency.

Multimedia,Artificial Intelligence,Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address several key issues in the Text-to-Image Retrieval task. Specifically: 1. **Insufficient Visual Information**: Existing generative methods assign identifiers (such as image IDs) to images that lack sufficient visual information, making it difficult for the model to accurately recognize image content during inference. 2. **Semantic Alignment Issue**: Current methods assign identifiers based solely on the visual content of images, ignoring high-level semantic information related to the text query. 3. **Learning Objective Gap**: Generative training focuses on predicting the correct image identifier, while the retrieval task requires obtaining a high-quality ranking list, leading to a discrepancy in learning objectives. To address the above issues, the authors propose an Autoregressive Voken Generation (AVG) method, redefining the text-to-image retrieval task as generating image vokens (i.e., visual tokens) from text sequences. By introducing a cross-modal aligned image tokenizer, AVG not only retains the low-level visual information of images but also injects corresponding high-level semantic information, thereby improving the effectiveness and efficiency of cross-modal retrieval. Additionally, the study introduces an auxiliary discriminative loss to correct the learning direction bias between generative training and retrieval objectives. Experimental results show that AVG significantly outperforms previous generative cross-modal retrieval methods on the Flickr and MS-COCO datasets and also has advantages in efficiency.

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Emage: Non-Autoregressive Text-to-Image Generation

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Autoregressive Image Generation without Vector Quantization

Unified Text-to-Image Generation and Retrieval

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Randomized Autoregressive Visual Generation

TAVT: Towards Transferable Audio-Visual Text Generation.

Image Understanding Makes for A Good Tokenizer for Image Generation

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Factorized Visual Tokenization and Generation

Variational Transformer: A Framework Beyond the Tradeoff Between Accuracy and Diversity for Image Captioning

Variational Transformer: A Framework Beyond the Trade-off Between Accuracy and Diversity for Image Captioning

Learning to Tokenize for Generative Retrieval

UATVR: Uncertainty-Adaptive Text-Video Retrieval

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation

Non-Autoregressive Video Captioning with Iterative Refinement