Abstract:We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.

What problem does this paper attempt to address?

### Problems the Paper Aims to Solve This paper aims to address several key challenges in open vocabulary image classification: 1. **Unconstrained Open Vocabulary Image Classification**: Existing vision-language models (such as CLIP) require an exhaustive list of category candidates for zero-shot classification, which limits their application when dealing with images of unknown content or context. The paper proposes a new method called NOVIC (uNconstrained Open Vocabulary Image Classifier), which can directly generate free-form object nouns from images without any predefined category candidates or prompts. 2. **Real-time Capability**: Traditional multimodal large language models (LLMs), although powerful, are computationally expensive and require dedicated remote servers for inference, making them unsuitable for real-time response needs. NOVIC aims to generate object nouns in real-time, suitable for video frame rate applications. 3. **Extensive Object Recognition Capability**: Existing open vocabulary learning methods typically rely on limited candidate label lists, which contain at most a few thousand entries and cannot cover all possible object categories. NOVIC, by using a large-scale synthetic text dataset for training, can generate truly unconstrained object labels, enabling fine-grained classification of any image. ### Main Contributions 1. **Innovative Open Vocabulary Object Decoder Model**: This model is trained solely on text data and can perform zero-shot classification on any image in real-time without providing any category candidates or prompts. 2. **Automated Construction of a Comprehensive English Object Noun Dictionary**: Using a multi-set prompt scheme and large language models (LLMs) to generate a large-scale synthetic title-object dataset. 3. **Creation of Three New Open Vocabulary Image Datasets**: These datasets are used to evaluate open vocabulary classification performance and provide annotations from humans and multimodal LLMs. 4. **Superior Performance**: Experimental results show that NOVIC's performance improves with the underlying CLIP model, achieving up to 87.5% prediction accuracy in real-world scenarios. ### Method Overview 1. **Dataset Generation**: Synthetic generation of title-object pairs that map to target object nouns. Using an English dictionary, prompt templates, and LLMs to generate a large-scale synthetic dataset. 2. **Object Decoder Training**: Using the frozen text encoder of the CLIP model to encode titles into text embedding vectors, and adding noise online to enhance the model's generalization ability. Training a decoder-only Transformer model to generate object nouns corresponding to each title. 3. **Inference Process**: During inference, the object decoder can seamlessly generalize to image embedding vectors computed by the CLIP model's image encoder, despite the typically large modality gap in the CLIP model. ### Experimental Results - On standard image classification benchmarks (such as ImageNet-1K), NOVIC demonstrated competitive Top-1 accuracy. - Performance on open vocabulary image datasets was particularly outstanding, achieving up to 87.5% prediction accuracy. - Compared to human annotations, LLM annotations scored slightly lower but still provided valuable insights into model performance. Overall, NOVIC addresses multiple key issues in open vocabulary image classification through innovative methods, achieving efficient, accurate, and real-time object recognition.

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

No Token Left Behind: Explainability-Aided Image Classification and Generation

Transductive Zero-Shot and Few-Shot CLIP

Online Zero-Shot Classification with CLIP

Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss

Open-VCLIP: Transforming CLIP to an Open-vocabulary Video Model Via Interpolated Weight Optimization

Building an Open-Vocabulary Video CLIP Model With Better Architectures, Optimization and Data

PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts

Image-free Classifier Injection for Zero-Shot Classification

Learning Transferable Visual Models From Natural Language Supervision

AutoCLIP: Auto-tuning Zero-Shot Classifiers for Vision-Language Models

CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No

CLIP-VIS: Adapting CLIP for Open-Vocabulary Video Instance Segmentation

Zero-Shot Text-to-Image Generation

Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP

From Association to Generation: Text-only Captioning by Unsupervised Cross-modal Mapping

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

OCFormer: One-Class Transformer Network for Image Classification

CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

Simple Image-level Classification Improves Open-vocabulary Object Detection