Abstract:Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: <a class="link-external link-https" href="https://github.com/joeyy5588/VRL/tree/main" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the case of scarce data, how to improve the generalization ability of image classification models, especially for fine - grained and novel concepts classification tasks. Specifically, the author proposes the Verbalized Representation Learning (VRL) method, aiming to extract interpretable features through natural - language descriptions, thereby enhancing the model's classification performance with a small number of samples. ### Core problems of the paper 1. **Few - sample learning in fine - grained classification**: - How can we identify the subtle differences between different categories and capture the common features within the same category when there are only a small number of samples? - For example, when distinguishing between two types of fish, the model needs to be able to recognize the differences in head patterns and at the same time notice their common white stripes. 2. **Classification of novel concepts**: - How can we effectively classify objects that rarely appear or do not exist at all in the pre - training dataset? - For example, use the Kiki - Bouba dataset to test the model's ability to recognize abstract shapes. ### Solutions To solve the above problems, the author proposes Verbalized Representation Learning (VRL), and its main steps include: - **Extract cross - category differences and within - category commonalities**: - Use the Vision - Language Model (VLM) to describe the key distinguishing features between different categories (inter - class differences) and the shared features within the same category (intra - class commonalities). - For example, for two different kinds of fish, VRL will generate a language description that describes the differences in head color and pattern; for fish of the same category, VRL will generate a language description that describes their common features (such as orange spots on the face). - **Map language descriptions to numerical vectors**: - Convert these language descriptions into numerical vectors for use in downstream classification tasks. - Specifically, given an image and a set of language descriptions, the VLM will evaluate whether the image has the features described by these descriptions and generate a feature vector \( \mathbf{F} \). Each dimension represents the presence or absence or degree of a certain feature. - **Training and inference**: - Use the generated numerical feature vectors to train various classifiers (such as logistic regression, random forest, MLP, etc.). - In the inference stage, given a test image, generate its feature vector and pass it to the trained classifier for prediction. ### Experimental results The experimental results show that VRL has significant advantages in the following aspects: - **Data efficiency**: Using only 10 samples (compared to more than 200 samples used by other methods), VRL can still achieve better classification results than existing methods. - **Model scale**: Even using a smaller model (7B parameters), VRL can outperform existing methods using large - scale models (70B parameters). - **New concept adaptability**: VRL also performs well when dealing with novel concepts that have never been seen before. Through these improvements, VRL not only improves the classification performance, but also enhances the interpretability and robustness of the model, and is suitable for low - resource scenarios such as fine - grained classification and novel concept recognition.

Verbalized Representation Learning for Interpretable Few-Shot Generalization

VisuaLizations As Intermediate Representations (VLAIR): an Approach for Applying Deep Learning-Based Computer Vision to Non-Image-based Data

Unified Lexical Representation for Interpretable Visual-Language Alignment

GRILL: Grounded Vision-language Pre-training via Aligning Text and Image Regions

Seeing Beyond Classes: Zero-Shot Grounded Situation Recognition via Language Explainer

Shaping Visual Representations with Language for Few-shot Classification

Vision-Language Alignment Learning Under Affinity and Divergence Principles for Few-Shot Out-of-Distribution Generalization

Visual Grounding for Object-Level Generalization in Reinforcement Learning

Verbalized Graph Representation Learning: A Fully Interpretable Graph Model Based on Large Language Models Throughout the Entire Process

Democratizing Fine-grained Visual Recognition with Large Language Models

SgVA-CLIP: Semantic-Guided Visual Adapting of Vision-Language Models for Few-Shot Image Classification

Verbalized Machine Learning: Revisiting Machine Learning with Language Models

Refining Skewed Perceptions in Vision-Language Models through Visual Representations

The Neglected Tails in Vision-Language Models

Fine-Grained Visual Prompt Learning of Vision-Language Models for Image Recognition

Learnable Visual Words for Interpretable Image Recognition

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

VLG: General Video Recognition with Web Textual Knowledge

DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction

Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding