Abstract:Humans recognize objects after observing only a few examples, a remarkable capability enabled by their inherent language understanding of the real-world environment. Developing verbalized and interpretable representation can significantly improve model generalization in low-data settings. In this work, we propose Verbalized Representation Learning (VRL), a novel approach for automatically extracting human-interpretable features for object recognition using few-shot data. Our method uniquely captures inter-class differences and intra-class commonalities in the form of natural language by employing a Vision-Language Model (VLM) to identify key discriminative features between different classes and shared characteristics within the same class. These verbalized features are then mapped to numeric vectors through the VLM. The resulting feature vectors can be further utilized to train and infer with downstream classifiers. Experimental results show that, at the same model scale, VRL achieves a 24% absolute improvement over prior state-of-the-art methods while using 95% less data and a smaller mode. Furthermore, compared to human-labeled attributes, the features learned by VRL exhibit a 20% absolute gain when used for downstream classification tasks. Code is available at: <a class="link-external link-https" href="https://github.com/joeyy5588/VRL/tree/main" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the case of scarce data, how to improve the generalization ability of image classification models, especially for fine - grained and novel concepts classification tasks. Specifically, the author proposes the Verbalized Representation Learning (VRL) method, aiming to extract interpretable features through natural - language descriptions, thereby enhancing the model's classification performance with a small number of samples.
### Core problems of the paper
1. **Few - sample learning in fine - grained classification**:
- How can we identify the subtle differences between different categories and capture the common features within the same category when there are only a small number of samples?
- For example, when distinguishing between two types of fish, the model needs to be able to recognize the differences in head patterns and at the same time notice their common white stripes.
2. **Classification of novel concepts**:
- How can we effectively classify objects that rarely appear or do not exist at all in the pre - training dataset?
- For example, use the Kiki - Bouba dataset to test the model's ability to recognize abstract shapes.
### Solutions
To solve the above problems, the author proposes Verbalized Representation Learning (VRL), and its main steps include:
- **Extract cross - category differences and within - category commonalities**:
- Use the Vision - Language Model (VLM) to describe the key distinguishing features between different categories (inter - class differences) and the shared features within the same category (intra - class commonalities).
- For example, for two different kinds of fish, VRL will generate a language description that describes the differences in head color and pattern; for fish of the same category, VRL will generate a language description that describes their common features (such as orange spots on the face).
- **Map language descriptions to numerical vectors**:
- Convert these language descriptions into numerical vectors for use in downstream classification tasks.
- Specifically, given an image and a set of language descriptions, the VLM will evaluate whether the image has the features described by these descriptions and generate a feature vector \( \mathbf{F} \). Each dimension represents the presence or absence or degree of a certain feature.
- **Training and inference**:
- Use the generated numerical feature vectors to train various classifiers (such as logistic regression, random forest, MLP, etc.).
- In the inference stage, given a test image, generate its feature vector and pass it to the trained classifier for prediction.
### Experimental results
The experimental results show that VRL has significant advantages in the following aspects:
- **Data efficiency**: Using only 10 samples (compared to more than 200 samples used by other methods), VRL can still achieve better classification results than existing methods.
- **Model scale**: Even using a smaller model (7B parameters), VRL can outperform existing methods using large - scale models (70B parameters).
- **New concept adaptability**: VRL also performs well when dealing with novel concepts that have never been seen before.
Through these improvements, VRL not only improves the classification performance, but also enhances the interpretability and robustness of the model, and is suitable for low - resource scenarios such as fine - grained classification and novel concept recognition.