Perceptual Image Quality Prediction: Are Contrastive Language–Image Pretraining (CLIP) Visual Features Effective?

Chibuike Onuoha,Jean Flaherty,Truong Cong Thang
DOI: https://doi.org/10.3390/electronics13040803
IF: 2.9
2024-02-20
Electronics
Abstract:In recent studies, the Contrastive Language–Image Pretraining (CLIP) model has showcased remarkable versatility in downstream tasks, ranging from image captioning and question-answering reasoning to image–text similarity rating, etc. In this paper, we investigate the effectiveness of CLIP visual features in predicting perceptual image quality. CLIP is also compared with competitive large multimodal models (LMMs) for this task. In contrast to previous studies, the results show that CLIP and other LMMs do not always provide the best performance. Interestingly, our evaluation experiment reveals that combining visual features from CLIP or other LMMs with some simple distortion features can significantly enhance their performance. In some cases, the improvements are even more than 10%, while the prediction accuracy surpasses 90%.
engineering, electrical & electronic,computer science, information systems,physics, applied
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve This paper aims to explore the effectiveness of the visual features of the Contrastive Language-Image Pretraining (CLIP) model in predicting perceived image quality. Specifically, the researchers evaluate whether the visual features of the CLIP model can effectively represent perceptual distortions in images and compare it with existing large multimodal models (LMMs) and other traditional image quality assessment models. ### Background and Motivation In the field of image quality assessment, especially in no-reference image quality assessment tasks, traditional methods mainly rely on distortion features. However, in recent years, multimodal models like CLIP have made significant progress in image and text alignment tasks. These models are pretrained on large-scale image and text datasets and have demonstrated strong capabilities in various downstream tasks. Therefore, the researchers aim to explore whether the visual features of these models can also be used for image quality assessment, particularly for predicting perceived image quality. ### Research Objectives 1. **Evaluate the effectiveness of CLIP visual features**: Study the performance of the visual features of the CLIP model in predicting perceived image quality. 2. **Compare the performance of different models**: Compare the CLIP model with other large multimodal models (such as HPS, ALTCLIP, ALIGN) and traditional image quality assessment models. 3. **Combine distortion features**: Explore whether combining the visual features of multimodal models like CLIP with simple distortion features can significantly improve prediction performance. ### Key Findings - **Performance of CLIP visual features**: Although the CLIP model performs well in many downstream tasks, its performance in image quality assessment tasks is not always optimal. In some cases, the performance of CLIP is even lower than that of other models. - **Effect of combining distortion features**: Experimental results show that combining the visual features of CLIP or other multimodal models with simple distortion features can significantly improve prediction performance. In some cases, the performance improvement exceeds 10%, with prediction accuracy surpassing 90%. ### Conclusion This study is the first to comprehensively evaluate the effectiveness of the visual features of CLIP and its related models in image quality assessment. Although the CLIP model performs well in many tasks, relying solely on its visual features may not be sufficient for image quality assessment tasks. Combining simple distortion features can significantly enhance performance, providing new insights and methods for future image quality assessment.