Abstract:In recent studies, the Contrastive Language–Image Pretraining (CLIP) model has showcased remarkable versatility in downstream tasks, ranging from image captioning and question-answering reasoning to image–text similarity rating, etc. In this paper, we investigate the effectiveness of CLIP visual features in predicting perceptual image quality. CLIP is also compared with competitive large multimodal models (LMMs) for this task. In contrast to previous studies, the results show that CLIP and other LMMs do not always provide the best performance. Interestingly, our evaluation experiment reveals that combining visual features from CLIP or other LMMs with some simple distortion features can significantly enhance their performance. In some cases, the improvements are even more than 10%, while the prediction accuracy surpasses 90%.

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve This paper aims to explore the effectiveness of the visual features of the Contrastive Language-Image Pretraining (CLIP) model in predicting perceived image quality. Specifically, the researchers evaluate whether the visual features of the CLIP model can effectively represent perceptual distortions in images and compare it with existing large multimodal models (LMMs) and other traditional image quality assessment models. ### Background and Motivation In the field of image quality assessment, especially in no-reference image quality assessment tasks, traditional methods mainly rely on distortion features. However, in recent years, multimodal models like CLIP have made significant progress in image and text alignment tasks. These models are pretrained on large-scale image and text datasets and have demonstrated strong capabilities in various downstream tasks. Therefore, the researchers aim to explore whether the visual features of these models can also be used for image quality assessment, particularly for predicting perceived image quality. ### Research Objectives 1. **Evaluate the effectiveness of CLIP visual features**: Study the performance of the visual features of the CLIP model in predicting perceived image quality. 2. **Compare the performance of different models**: Compare the CLIP model with other large multimodal models (such as HPS, ALTCLIP, ALIGN) and traditional image quality assessment models. 3. **Combine distortion features**: Explore whether combining the visual features of multimodal models like CLIP with simple distortion features can significantly improve prediction performance. ### Key Findings - **Performance of CLIP visual features**: Although the CLIP model performs well in many downstream tasks, its performance in image quality assessment tasks is not always optimal. In some cases, the performance of CLIP is even lower than that of other models. - **Effect of combining distortion features**: Experimental results show that combining the visual features of CLIP or other multimodal models with simple distortion features can significantly improve prediction performance. In some cases, the performance improvement exceeds 10%, with prediction accuracy surpassing 90%. ### Conclusion This study is the first to comprehensively evaluate the effectiveness of the visual features of CLIP and its related models in image quality assessment. Although the CLIP model performs well in many tasks, relying solely on its visual features may not be sufficient for image quality assessment tasks. Combining simple distortion features can significantly enhance performance, providing new insights and methods for future image quality assessment.

Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

Exploring CLIP for Assessing the Look and Feel of Images

Improving Visual Counterfactual Explanation Models for Image Classification via CLIP

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Diffusion Feedback Helps CLIP See Better

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity

CLIP Guided Image-perceptive Prompt Learning for Image Enhancement

Enhancing Multimodal Understanding with CLIP-Based Image-to-Text Transformation

CLIP with Quality Captions: A Strong Pretraining for Vision Tasks

A Closer Look at the Robustness of Contrastive Language-Image Pre-Training (CLIP)

How Much Can CLIP Benefit Vision-and-Language Tasks?

Non-Contrastive Learning Meets Language-Image Pre-Training

MLIP: Efficient Multi-Perspective Language-Image Pretraining with Exhaustive Data Utilization

Contrastive Localized Language-Image Pre-Training

CLIPPO: Image-and-Language Understanding from Pixels Only

Iclip: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

Improving CLIP Training with Language Rewrites

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives