Object Recognition as Next Token Prediction

Kaiyu Yue,Bor-Chun Chen,Jonas Geiping,Hengduo Li,Tom Goldstein,Ser-Nam Lim
2024-04-01
Abstract:We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at <a class="link-external link-https" href="https://github.com/kaiyuyue/nxtp" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the problem of eliminating the dependence on predefined object labels or descriptions in object recognition tasks, thereby improving the model's flexibility and generalization ability. Specifically, the paper proposes a new method that transforms the object recognition task into a next-token prediction task by using a language decoder (such as a Transformer-based language model) to autoregressively predict object labels from image embeddings. This approach aims to overcome the limitations of traditional methods (such as CLIP) that require a predefined set of object descriptions for object recognition, which may not cover all possible object categories and could lead to performance degradation in practical applications. The key innovations of the paper include: 1. **Non-causal Attention Mask**: To make tokens between different labels independent while maintaining conditional relevance within the same label, the paper designs a non-causal attention mask mechanism. This mechanism not only improves the efficiency of the model but also enables the model to generate tokens for multiple labels in parallel. 2. **One-shot Sampling**: A new sampling method called one-shot sampling is proposed, which can generate tokens for multiple labels in parallel in a single operation and sort them based on their probabilities. This leverages the powerful parallelization capabilities of the Transformer model, making the object recognition process more efficient. 3. **Compact Decoder**: To improve the efficiency of the model, the paper proposes a simple strategy to construct a compact decoder by removing intermediate blocks from the pre-trained language model. Experimental results show that this compact decoder performs comparably to the full model but with significantly faster inference speed. Overall, the goal of the paper is to develop a more flexible and efficient object recognition method that can accurately predict object labels from images without relying on predefined labels.