Abstract:Ordinal regression is a fundamental problem within the field of computer vision, with customised well-trained models on specific tasks. While pre-trained vision-language models (VLMs) have exhibited impressive performance on various vision tasks, their potential for ordinal regression has received less exploration. In this study, we first investigate CLIP's potential for ordinal regression, from which we expect the model could generalise to different ordinal regression tasks and scenarios. Unfortunately, vanilla CLIP fails on this task, since current VLMs have a well-documented limitation of encapsulating compositional concepts such as number sense. We propose a simple yet effective method called NumCLIP to improve the quantitative understanding of VLMs. We disassemble the exact image to number-specific text matching problem into coarse classification and fine prediction stages. We discretize and phrase each numerical bin with common language concept to better leverage the available pre-trained alignment in CLIP. To consider the inherent continuous property of ordinal regression, we propose a novel fine-grained cross-modal ranking-based regularisation loss specifically designed to keep both semantic and ordinal alignment in CLIP's feature space. Experimental results on three general ordinal regression tasks demonstrate the effectiveness of NumCLIP, with 10% and 3.83% accuracy improvement on historical image dating and image aesthetics assessment task, respectively. Code is publicly available at <a class="link-external link-https" href="https://github.com/xmed-lab/NumCLIP" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the poor performance of visual - language models (VLMs) in ordinal regression tasks, especially the limitations of the CLIP model in such tasks. Specifically: 1. **Ordinal Regression Problem**: Ordinal regression is a machine - learning task that aims to predict labels with an intrinsic order relationship. For example, tasks such as age estimation, historical image - date prediction, and image - aesthetics evaluation all belong to ordinal regression problems. 2. **Limitations of Existing Methods**: - **Direct Application of CLIP**: The method of directly treating numerical - index labels as class markers and using zero - shot or few - shot learning for prediction has limited effectiveness. For example, in age estimation, the zero - shot MAE is 6.09, and the zero - shot accuracy in historical image - date prediction is only 26.08%. - **Insufficient Pretraining Data**: Current visual - language models lack sufficient numerical descriptions in the pretraining stage. Especially when dealing with larger numbers, they are more inclined to use approximate or qualitative descriptions rather than exact numbers. - **Ineffective Training Objectives**: Existing contrastive - learning objectives fail to effectively distinguish numerical information, resulting in a weak ability of the model to understand numbers. 3. **Proposed Method**: To solve the above problems, the paper proposes the NumCLIP method, which aims to improve the performance of the CLIP model in ordinal regression tasks in the following ways: - **Coarse - to - Fine - grained Learning Paradigm**: Decompose the precise image - to - number matching problem into two stages: coarse - grained classification and fine - grained prediction. By discretizing numbers and describing each number interval with common language concepts, better utilize the alignment ability of the pretraining model. - **Cross - Modal Ranking Feature Regularization Loss**: Introduce a new fine - grained cross - modal ranking feature regularization loss to maintain semantic and ordinal alignment and ensure that the model can correctly represent ordinal relationships in the feature space. 4. **Contributions**: - Propose the NumCLIP method, which combines coarse - to - fine - grained learning, recasts weak image - to - number alignment into strong image - to - language alignment, and fine - grained feature regularization, significantly improving the performance of CLIP in ordinal regression tasks. - For the first time, extend cross - modal contrastive learning to adapt to ordinal regression tasks and conduct a theoretical analysis from the perspective of mutual information. - Experimental results show that NumCLIP outperforms previous state - of - the - art methods on three widely - used benchmark datasets, achieving significant performance improvements in historical image - date prediction, image - aesthetics evaluation, and age - estimation tasks respectively. In summary, this paper solves the limitations of the CLIP model in ordinal regression tasks by improving its numerical - understanding ability and proposes an effective solution.

Teach CLIP to Develop a Number Sense for Ordinal Regression

OrdinalCLIP: Learning Rank Prompts for Language-Guided Ordinal Regression

CountCLIP -- [Re] Teaching CLIP to Count to Ten

Learning-to-Rank Meets Language: Boosting Language-Driven Ordering Alignment for Ordinal Classification

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

CORE: Learning Consistent Ordinal REpresentations for Image Ordinal Estimation

Ord2Seq: Regarding Ordinal Regression As Label Sequence Prediction

RankCLIP: Ranking-Consistent Language-Image Pretraining

DiffCLIP: Few-shot Language-driven Multimodal Classifier

CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference

TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation

Investigating the Limitation of CLIP Models: The Worst-Performing Categories

CLIP-KD: An Empirical Study of CLIP Model Distillation

AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning

What Makes CLIP More Robust to Long-Tailed Pre-Training Data? A Controlled Study for Transferable Insights

TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training

Transductive Zero-Shot and Few-Shot CLIP

ReCLIP++: Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation