Abstract:The zero-shot performance of existing vision-language models (VLMs) such as CLIP is limited by the availability of large-scale, aligned image and text datasets in specific domains. In this work, we leverage two complementary sources of information -- descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side, we develop methods to train VLMs with this "bag-level" image-text supervision. We find that simply using these attributes at test-time does not improve performance, but our training strategy, for example, on the iNaturalist dataset, leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways, we generate descriptions that capture visual appearance, habitat, and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We release the benchmark, consisting of 14 datasets at

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper aims to solve the problem of poor zero - shot classification performance of existing Vision - Language Models (VLMs) in the fine - grained domain. Specifically, existing VLMs such as CLIP [29] have limited performance in zero - shot classification tasks in specific domains, mainly due to the lack of large - scale, aligned image and text datasets. In addition, these models also perform poorly in encoding visual attributes in the fine - grained domain. To improve this situation, the authors propose to use two complementary information sources to improve the zero - shot classification performance of VLMs in the fine - grained domain: 1. **Category descriptions generated by large - language models (LLMs)**: Generate detailed descriptions of fine - grained categories through LLMs, including information such as appearance, habitat, and geographical area. 2. **Rich fine - grained image classification datasets**: Utilize existing fine - grained image classification datasets, such as iNaturalist [41] and NABirds [39], and combine them with the descriptions generated by LLMs to generate a roughly aligned image - text dataset for fine - tuning VLMs. Through these methods, the authors have verified the effectiveness of their method on multiple datasets and achieved significant performance improvements. For example, on the iNaturalist dataset, using their training strategy can increase the zero - shot classification accuracy of new bird and flower categories by an average of 4 - 5%. ### Main contributions 1. **Generate fine - grained category descriptions**: Generate detailed descriptions of fine - grained categories through LLMs, including information such as appearance, habitat, and geographical area. 2. **Improve the training strategy of VLMs**: Develop a new training method that uses "bag - level" supervision, that is, a group of images is paired with a group of descriptions, but lacks image - text correspondence. By randomly pairing images and text and training with category - level contrastive loss, the robustness and performance of the model are improved. 3. **Systematically evaluate the effectiveness of the method**: Systematically evaluate the impact of different types of descriptions (such as visual, habitat, and taxonomic information) on zero - shot classification performance on multiple datasets, and find that geographical prior information is as effective and complementary as visual appearance information. ### Experimental results - **Performance improvement on fine - grained datasets**: On fine - grained datasets such as CUB, Stanford Cars, FGVC Aircrafts, Flowers 102, and Food 101, the author's method significantly improves the performance of zero - shot classification. - **Generalization ability across architectures**: The author's method is not only effective on the CLIP model, but also shows good generalization ability on other architectures. - **Transfer performance on external datasets**: Even when trained on external datasets (such as NABirds and iNaturalist), the author's method can still significantly improve the performance on the target dataset. In summary, the paper proposes an effective training strategy by combining fine - grained category descriptions generated by LLMs and existing image classification datasets, which significantly improves the zero - shot classification performance of VLMs in the fine - grained domain.

Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Pushing Boundaries: Exploring Zero Shot Object Classification with Large Multimodal Models

LaFTer: Label-Free Tuning of Zero-shot Classifier using Language and Unlabeled Image Collections

LLM meets Vision-Language Models for Zero-Shot One-Class Classification

Visual Classification via Description from Large Language Models

Vision-Language Models for Zero-Shot Classification of Remote Sensing Images

At First Sight: Zero-Shot Classification of Astronomical Images with Large Multimodal Models

The Neglected Tails in Vision-Language Models

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models

Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment

Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity

Zero-Shot Prompting and Few-Shot Fine-Tuning: Revisiting Document Image Classification Using Large Language Models

Label Propagation for Zero-shot Classification with Vision-Language Models

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification

Challenges of Zero-Shot Recognition with Vision-Language Models: Granularity and Correctness

Prompting Scientific Names for Zero-Shot Species Recognition

Improving Zero-Shot Generalization for CLIP with Variational Adapter

Zero-shot animal behavior classification with vision-language foundation models

The Benefits of Label-Description Training for Zero-Shot Text Classification

Zero-Shot Visual Reasoning by Vision-Language Models: Benchmarking and Analysis