Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion

Philipp Allgeuer,Kyra Ahrens,Stefan Wermter
2024-07-18
Abstract:We introduce NOVIC, an innovative uNconstrained Open Vocabulary Image Classifier that uses an autoregressive transformer to generatively output classification labels as language. Leveraging the extensive knowledge of CLIP models, NOVIC harnesses the embedding space to enable zero-shot transfer from pure text to images. Traditional CLIP models, despite their ability for open vocabulary classification, require an exhaustive prompt of potential class labels, restricting their application to images of known content or context. To address this, we propose an "object decoder" model that is trained on a large-scale 92M-target dataset of templated object noun sets and LLM-generated captions to always output the object noun in question. This effectively inverts the CLIP text encoder and allows textual object labels to be generated directly from image-derived embedding vectors, without requiring any a priori knowledge of the potential content of an image. The trained decoders are tested on a mix of manually and web-curated datasets, as well as standard image classification benchmarks, and achieve fine-grained prompt-free prediction scores of up to 87.5%, a strong result considering the model must work for any conceivable image and without any contextual clues.
Computer Vision and Pattern Recognition,Artificial Intelligence,Computation and Language
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve This paper aims to address several key challenges in open vocabulary image classification: 1. **Unconstrained Open Vocabulary Image Classification**: Existing vision-language models (such as CLIP) require an exhaustive list of category candidates for zero-shot classification, which limits their application when dealing with images of unknown content or context. The paper proposes a new method called NOVIC (uNconstrained Open Vocabulary Image Classifier), which can directly generate free-form object nouns from images without any predefined category candidates or prompts. 2. **Real-time Capability**: Traditional multimodal large language models (LLMs), although powerful, are computationally expensive and require dedicated remote servers for inference, making them unsuitable for real-time response needs. NOVIC aims to generate object nouns in real-time, suitable for video frame rate applications. 3. **Extensive Object Recognition Capability**: Existing open vocabulary learning methods typically rely on limited candidate label lists, which contain at most a few thousand entries and cannot cover all possible object categories. NOVIC, by using a large-scale synthetic text dataset for training, can generate truly unconstrained object labels, enabling fine-grained classification of any image. ### Main Contributions 1. **Innovative Open Vocabulary Object Decoder Model**: This model is trained solely on text data and can perform zero-shot classification on any image in real-time without providing any category candidates or prompts. 2. **Automated Construction of a Comprehensive English Object Noun Dictionary**: Using a multi-set prompt scheme and large language models (LLMs) to generate a large-scale synthetic title-object dataset. 3. **Creation of Three New Open Vocabulary Image Datasets**: These datasets are used to evaluate open vocabulary classification performance and provide annotations from humans and multimodal LLMs. 4. **Superior Performance**: Experimental results show that NOVIC's performance improves with the underlying CLIP model, achieving up to 87.5% prediction accuracy in real-world scenarios. ### Method Overview 1. **Dataset Generation**: Synthetic generation of title-object pairs that map to target object nouns. Using an English dictionary, prompt templates, and LLMs to generate a large-scale synthetic dataset. 2. **Object Decoder Training**: Using the frozen text encoder of the CLIP model to encode titles into text embedding vectors, and adding noise online to enhance the model's generalization ability. Training a decoder-only Transformer model to generate object nouns corresponding to each title. 3. **Inference Process**: During inference, the object decoder can seamlessly generalize to image embedding vectors computed by the CLIP model's image encoder, despite the typically large modality gap in the CLIP model. ### Experimental Results - On standard image classification benchmarks (such as ImageNet-1K), NOVIC demonstrated competitive Top-1 accuracy. - Performance on open vocabulary image datasets was particularly outstanding, achieving up to 87.5% prediction accuracy. - Compared to human annotations, LLM annotations scored slightly lower but still provided valuable insights into model performance. Overall, NOVIC addresses multiple key issues in open vocabulary image classification through innovative methods, achieving efficient, accurate, and real-time object recognition.