Abstract:Accurately describing images via text is a foundation of explainable AI. Vision-Language Models (VLMs) like CLIP have recently addressed this by aligning images and texts in a shared embedding space, expressing semantic similarities between vision and language embeddings. VLM classification can be improved with descriptions generated by Large Language Models (LLMs). However, it is difficult to determine the contribution of actual description semantics, as the performance gain may also stem from a semantic-agnostic ensembling effect. Considering this, we ask how to distinguish the actual discriminative power of descriptions from performance boosts that potentially rely on an ensembling effect. To study this, we propose an alternative evaluation scenario that shows a characteristic behavior if the used descriptions have discriminative power. Furthermore, we propose a training-free method to select discriminative descriptions that work independently of classname ensembling effects. The training-free method works in the following way: A test image has a local CLIP label neighborhood, i.e., its top-$k$ label predictions. Then, w.r.t. to a small selection set, we extract descriptions that distinguish each class well in the local neighborhood. Using the selected descriptions, we demonstrate improved classification accuracy across seven datasets and provide in-depth analysis and insights into the explainability of description-based image classification by VLMs.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **Does the classification of Vision - Language Models (VLM) truly benefit from the descriptive semantics generated by Large Language Models (LLM)?** Specifically, the paper focuses on how to evaluate and ensure that the improvement in VLM classification performance is due to the real contribution of the descriptive semantics generated by LLM, rather than other factors such as the noise augmentation effect. The following are the main problems and challenges in the paper: 1. **Effectiveness of Descriptive Semantics**: - Descriptions generated by LLM may contain too much general information or non - distinguishing features (for example, both "parrot" and "sparrow" are described as having feathers), and these descriptions cannot effectively distinguish different categories. - Providing too many LLM - generated descriptions may lead to information redundancy, and it is difficult to determine the specific contribution of each description to the final classification decision. 2. **Noise Augmentation Effect**: - Research has found that even replacing the descriptions generated by LLM with random characters or high - level concepts can still improve model performance. This indicates that the performance improvement may be due to the noise augmentation effect, rather than the real semantic contribution of the LLM - generated descriptions. - Therefore, a method needs to be designed to distinguish whether the performance improvement comes from real semantic understanding or the noise augmentation effect. 3. **Model Interpretability**: - In order to improve model interpretability, the research proposes a training - free method to select discriminative descriptions, rather than simply relying on the combination of class names. - This method ensures that the performance improvement is due to semantic richness, rather than simple noise augmentation. ### Solutions The paper proposes a new evaluation scenario and method to ensure that the performance improvement in VLM classification indeed comes from the real semantic contribution of the LLM - generated descriptions: 1. **Avoiding the Noise Augmentation Effect**: - By restricting the model to only use classname - free descriptions (text descriptions without class names), ensure that the performance improvement is not caused by noise augmentation. 2. **Selecting Discriminative Descriptions**: - Use a training - free algorithm to process text description embeddings within the neighborhood of query image embeddings, focusing on distinguishing ambiguous categories in a specific subset. - This method first screens out candidate labels through class names, and then filters out those descriptions that are too general or ambiguous, ensuring that the remaining descriptions provide specific visual - language cues in the local candidate neighborhood. 3. **Evaluation Framework**: - Propose a new evaluation framework to ensure that the performance improvement is driven by real semantic understanding, rather than the noise augmentation effect. - By adjusting the class name weight $ w_{\text{cls}} $, evaluate the performance changes under different settings and verify the importance of classname - free descriptions. ### Experimental Results The experimental results show that the method proposed in the paper significantly improves the classification performance on multiple datasets, and the performance is particularly outstanding in the classname - free setting. In particular, for datasets such as EuroSAT, Flowers102, CUB200, DTD, and Places, the performance improvement can reach more than 8%. In addition, this method also improves the model's interpretability, ensuring that the performance improvement is based on real semantic understanding. In conclusion, through innovative evaluation methods and algorithm design, this paper solves the problem of the effectiveness of the descriptive semantics generated by LLM in VLM classification, providing more reliable performance improvement and higher interpretability.

Does VLM Classification Benefit from LLM Description Semantics?

Visual Classification via Description from Large Language Models

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions

Rethinking VLMs and LLMs for Image Classification

Text Descriptions are Compressive and Invariant Representations for Visual Learning

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

Discriminative Fine-tuning of LVLMs

Enhancing Visual Classification using Comparative Descriptors

Why are Visually-Grounded Language Models Bad at Image Classification?

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

On Erroneous Agreements of CLIP Image Embeddings

CLAMP: Contrastive LAnguage Model Prompt-tuning

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

The Neglected Tails in Vision-Language Models

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs

An Introduction to Vision-Language Modeling

LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation

Two Effects, One Trigger: On the Modality Gap, Object Bias, and Information Imbalance in Contrastive Vision-Language Models