Abstract:We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at <a class="link-external link-https" href="https://github.com/kaiyuyue/nxtp" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The paper attempts to address the problem of eliminating the dependence on predefined object labels or descriptions in object recognition tasks, thereby improving the model's flexibility and generalization ability. Specifically, the paper proposes a new method that transforms the object recognition task into a next-token prediction task by using a language decoder (such as a Transformer-based language model) to autoregressively predict object labels from image embeddings. This approach aims to overcome the limitations of traditional methods (such as CLIP) that require a predefined set of object descriptions for object recognition, which may not cover all possible object categories and could lead to performance degradation in practical applications. The key innovations of the paper include: 1. **Non-causal Attention Mask**: To make tokens between different labels independent while maintaining conditional relevance within the same label, the paper designs a non-causal attention mask mechanism. This mechanism not only improves the efficiency of the model but also enables the model to generate tokens for multiple labels in parallel. 2. **One-shot Sampling**: A new sampling method called one-shot sampling is proposed, which can generate tokens for multiple labels in parallel in a single operation and sort them based on their probabilities. This leverages the powerful parallelization capabilities of the Transformer model, making the object recognition process more efficient. 3. **Compact Decoder**: To improve the efficiency of the model, the paper proposes a simple strategy to construct a compact decoder by removing intermediate blocks from the pre-trained language model. Experimental results show that this compact decoder performs comparably to the full model but with significantly faster inference speed. Overall, the goal of the paper is to develop a more flexible and efficient object recognition method that can accurately predict object labels from images without relying on predefined labels.

Object Recognition as Next Token Prediction

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Learning to Decode for Future Success

Generalized Decoding for Pixel, Image, and Language

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Mechanics of Next Token Prediction with Self-Attention

Emu3: Next-Token Prediction is All You Need

The pitfalls of next-token prediction

Detector Guidance for Multi-Object Text-to-Image Generation

Tokenize Anything via Prompting

Helping Hands: An Object-Aware Ego-Centric Video Recognition Model

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

Implicit Geometry of Next-token Prediction: From Language Sparsity Patterns to Model Representations

NOPE: Novel Object Pose Estimation from a Single Image

Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling

Pointing the Unknown Words

Lazy-k: Decoding for Constrained Token Classification

Transformer with token attention and attribute prediction for image captioning