Abstract:The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot classification tasks, descriptors incorporating additional cues, high-level concepts, or even random characters often outperform those using only the category name. In many classification tasks, while the top-1 accuracy may be relatively low, the top-5 accuracy is often significantly higher. This gap implies that most misclassifications occur among a few similar classes, highlighting the model's difficulty in distinguishing between classes with subtle differences. To address this challenge, we introduce a novel concept of comparative descriptors. These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation. By generating and integrating these comparative descriptors into the classification framework, we refine the semantic focus and improve classification accuracy. An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space, further enhancing performance. Our approach demonstrates improved accuracy and robustness in visual classification tasks by addressing the specific challenge of subtle inter-class differences.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is that in visual classification tasks, existing models have difficulty in distinguishing similar categories with subtle differences. Specifically: 1. **Background problems**: - In zero - shot classification tasks, although vision - language models (VLMs) such as CLIP perform well in many benchmark tests, they have difficulty in distinguishing similar categories with subtle differences. - In many classification tasks, although the Top - 1 accuracy rate is low, the Top - 5 accuracy rate is significantly higher, indicating that most misclassifications occur between a few similar categories. 2. **Research motivation**: - Existing methods generate descriptors by using large - language models (LLMs). These descriptors contain additional information, such as high - level concepts or random characters, which can improve classification performance. However, this method sometimes generates irrelevant or ambiguous descriptors, affecting the classification effect. - In order to improve classification accuracy and address the limitations of existing methods, the author proposes a new concept - comparative descriptors. These descriptors enhance the discrimination between classes by emphasizing the unique features between the target class and the most similar class. 3. **Solutions**: - **Generating comparative descriptors**: The author proposes a method to identify semantically similar categories and request LLMs to generate descriptors that highlight the unique features of the target class relative to these similar classes. For example, ask GPT: "How to distinguish photos of the target class and similar classes?" - **Filtering process**: To ensure that the generated descriptors are helpful for classification, the author introduces a filtering process to retain descriptors with high similarity to image embeddings in the CLIP space. Specifically, only the top k descriptors with the highest similarity to the average image features of each category are retained. 4. **Contributions**: - A new concept of comparative descriptors is proposed, which emphasizes the differences between the target class and similar classes, reducing problems such as modal understanding difficulties and word - sense ambiguity. - A simple filtering process is proposed to ensure that only descriptors that contribute to classification are retained. - Experimental results show that this method significantly improves image classification performance on multiple datasets while maintaining the interpretability of model decisions. In summary, this paper aims to solve the problem that existing models have difficulty in distinguishing similar categories in visual classification tasks by introducing comparative descriptors and a filtering process, thereby improving classification accuracy and robustness.

Enhancing Visual Classification using Comparative Descriptors

Visual Classification via Description from Large Language Models

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Does VLM Classification Benefit from LLM Description Semantics?

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

Text Descriptions are Compressive and Invariant Representations for Visual Learning

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Finetuning CLIP to Reason about Pairwise Differences

Fine-Grained Image Classification Via Combining Vision And Language

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Semantic Compositions Enhance Vision-Language Contrastive Learning

Improving Visual Counterfactual Explanation Models for Image Classification via CLIP

FewVS: A Vision-Semantics Integration Framework for Few-Shot Image Classification

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models

Semantically-Prompted Language Models Improve Visual Descriptions

Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models

S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions