Abstract:Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at <a class="link-external link-https" href="https://github.com/xmed-lab/NumCLIP" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Computation and Language,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the poor performance of visual - language models (VLMs) in ordinal regression tasks, especially the limitations of the CLIP model in such tasks. Specifically:
1. **Ordinal Regression Problem**: Ordinal regression is a machine - learning task that aims to predict labels with an intrinsic order relationship. For example, tasks such as age estimation, historical image - date prediction, and image - aesthetics evaluation all belong to ordinal regression problems.
2. **Limitations of Existing Methods**:
- **Direct Application of CLIP**: The method of directly treating numerical - index labels as class markers and using zero - shot or few - shot learning for prediction has limited effectiveness. For example, in age estimation, the zero - shot MAE is 6.09, and the zero - shot accuracy in historical image - date prediction is only 26.08%.
- **Insufficient Pretraining Data**: Current visual - language models lack sufficient numerical descriptions in the pretraining stage. Especially when dealing with larger numbers, they are more inclined to use approximate or qualitative descriptions rather than exact numbers.
- **Ineffective Training Objectives**: Existing contrastive - learning objectives fail to effectively distinguish numerical information, resulting in a weak ability of the model to understand numbers.
3. **Proposed Method**: To solve the above problems, the paper proposes the NumCLIP method, which aims to improve the performance of the CLIP model in ordinal regression tasks in the following ways:
- **Coarse - to - Fine - grained Learning Paradigm**: Decompose the precise image - to - number matching problem into two stages: coarse - grained classification and fine - grained prediction. By discretizing numbers and describing each number interval with common language concepts, better utilize the alignment ability of the pretraining model.
- **Cross - Modal Ranking Feature Regularization Loss**: Introduce a new fine - grained cross - modal ranking feature regularization loss to maintain semantic and ordinal alignment and ensure that the model can correctly represent ordinal relationships in the feature space.
4. **Contributions**:
- Propose the NumCLIP method, which combines coarse - to - fine - grained learning, recasts weak image - to - number alignment into strong image - to - language alignment, and fine - grained feature regularization, significantly improving the performance of CLIP in ordinal regression tasks.
- For the first time, extend cross - modal contrastive learning to adapt to ordinal regression tasks and conduct a theoretical analysis from the perspective of mutual information.
- Experimental results show that NumCLIP outperforms previous state - of - the - art methods on three widely - used benchmark datasets, achieving significant performance improvements in historical image - date prediction, image - aesthetics evaluation, and age - estimation tasks respectively.
In summary, this paper solves the limitations of the CLIP model in ordinal regression tasks by improving its numerical - understanding ability and proposes an effective solution.