Abstract:For full-reference image quality assessment (FR-IQA) using deep-learning approaches, the perceptual similarity score between a distorted image and a reference image is typically computed as a distance measure between features extracted from a pretrained CNN or more recently, a Transformer network. Often, these intermediate features require further fine-tuning or processing with additional neural network layers to align the final similarity scores with human judgments. So far, most IQA models based on foundation models have primarily relied on the final layer or the embedding for the quality score estimation. In contrast, this work explores the potential of utilizing the intermediate features of these foundation models, which have largely been unexplored so far in the design of low-level perceptual similarity metrics. We demonstrate that the intermediate features are comparatively more effective. Moreover, without requiring any training, these metrics can outperform both traditional and state-of-the-art learned metrics by utilizing distance measures between the features.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use the intermediate features of foundation models to develop more effective and more robust low - level perceptual similarity metrics. Specifically, the author explored whether using the intermediate features of pre - trained foundation models (such as CLIP and DINO) has more advantages than using the final embeddings in the full - reference image quality assessment (FR - IQA) task. ### Main problems 1. **Limitations of existing methods**: - Most of the existing FR - IQA methods rely on the final layer or embeddings of the foundation model to estimate the quality score. - These embeddings usually need further fine - tuning or processing to align with human judgment. 2. **Research motivation**: - The author observed that the intermediate features capture local, low - level information (such as edges, textures, etc.), and this information may be more important for low - level perceptual similarity metrics. - Therefore, the author proposed the hypothesis: can more accurate and robust low - level perceptual similarity metrics be developed by using the intermediate features of the foundation model? ### Solutions To verify this hypothesis, the author carried out the following several tasks: 1. **Compare the effects of embeddings and intermediate features**: - Through experimental comparison, evaluate the performance of using embeddings and intermediate features on different datasets. 2. **Evaluate the performance of different foundation models**: - Compare the performance of different foundation models such as CLIP and DINO, especially the robustness when dealing with geometric transformations (such as translation, scaling, rotation). 3. **Explore different distance measurement methods**: - Use multiple distribution and distance measurement methods (such as Wasserstein distance, Jensen - Shannon divergence, etc.) to evaluate the effectiveness of the intermediate features. ### Experimental results - **Intermediate features are superior to embeddings**: The experimental results show that the metrics using intermediate features perform better on multiple datasets, especially having higher robustness when dealing with geometric transformations. - **DINOv1 performs the best**: In particular, the DINOv1 model performs the best in all experiments, indicating that its intermediate features are very effective for low - level perceptual similarity tasks. ### Conclusion This research proves that using the intermediate features of the foundation model can significantly improve the accuracy and robustness of low - level perceptual similarity metrics. Future work will focus on the fine - tuning of these features to further improve performance. ### Formula examples The distance measurement formulas involved in the paper are as follows: - **Cosine distance**: \[ \text{Cosine Distance} = 1-\frac{\mathbf{A}\cdot\mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|} \] - **Wasserstein distance**: \[ W(P, Q)=\inf_{\gamma\in\Pi(P, Q)}\int_{\mathcal{X}\times\mathcal{X}}d(x, y)\,d\gamma(x, y) \] where \(P\) and \(Q\) are two probability distributions, and \(d(x, y)\) is the distance function between sample points. Through these methods, the author shows the potential of intermediate features in low - level perceptual similarity metrics and provides a new direction for future research.

Foundation Models Boost Low-Level Perceptual Similarity Metrics

Comparison of Full-Reference Image Quality Models for Optimization of Image Processing Systems

Comparison of Image Quality Models for Optimization of Image Processing Systems

A study of deep perceptual metrics for image quality assessment

DeepDC: Deep Distance Correlation as a Perceptual Image Quality Evaluator

From Distance to Dependency: A Paradigm Shift of Full-reference Image Quality Assessment

An Accurate Deep Convolutional Neural Networks Model for No-Reference Image Quality Assessment

Conformer and Blind Noisy Students for Improved Image Quality Assessment

Reference-Free Image Quality Metric for Degradation and Reconstruction Artifacts

A Multiscale Approach to Deep Blind Image Quality Assessment

Image Quality Assessment with Transformers and Multi-Metric Fusion Modules

TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment

Boosting External-Reference Image Quality Assessment by Content-Constrain Loss and Attention-based Adaptive Feature Fusion

ConIQA: A deep learning method for perceptual image quality assessment with limited data

Full-reference image quality assessment by combining global and local distortion measures

A Multibranch Network With Multilayer Feature Fusion for No-Reference Image Quality Assessment

Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment

You Only Train Once: A Unified Framework for Both Full-Reference and No-Reference Image Quality Assessment

Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Deep Quality: A Deep No-reference Quality Assessment System

Sparse feature fidelity for perceptual image quality assessment