Abstract:For full-reference image quality assessment (FR-IQA) using deep-learning approaches, the perceptual similarity score between a distorted image and a reference image is typically computed as a distance measure between features extracted from a pretrained CNN or more recently, a Transformer network. Often, these intermediate features require further fine-tuning or processing with additional neural network layers to align the final similarity scores with human judgments. So far, most IQA models based on foundation models have primarily relied on the final layer or the embedding for the quality score estimation. In contrast, this work explores the potential of utilizing the intermediate features of these foundation models, which have largely been unexplored so far in the design of low-level perceptual similarity metrics. We demonstrate that the intermediate features are comparatively more effective. Moreover, without requiring any training, these metrics can outperform both traditional and state-of-the-art learned metrics by utilizing distance measures between the features.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: how to use the intermediate features of foundation models to develop more effective and more robust low - level perceptual similarity metrics. Specifically, the author explored whether using the intermediate features of pre - trained foundation models (such as CLIP and DINO) has more advantages than using the final embeddings in the full - reference image quality assessment (FR - IQA) task.
### Main problems
1. **Limitations of existing methods**:
- Most of the existing FR - IQA methods rely on the final layer or embeddings of the foundation model to estimate the quality score.
- These embeddings usually need further fine - tuning or processing to align with human judgment.
2. **Research motivation**:
- The author observed that the intermediate features capture local, low - level information (such as edges, textures, etc.), and this information may be more important for low - level perceptual similarity metrics.
- Therefore, the author proposed the hypothesis: can more accurate and robust low - level perceptual similarity metrics be developed by using the intermediate features of the foundation model?
### Solutions
To verify this hypothesis, the author carried out the following several tasks:
1. **Compare the effects of embeddings and intermediate features**:
- Through experimental comparison, evaluate the performance of using embeddings and intermediate features on different datasets.
2. **Evaluate the performance of different foundation models**:
- Compare the performance of different foundation models such as CLIP and DINO, especially the robustness when dealing with geometric transformations (such as translation, scaling, rotation).
3. **Explore different distance measurement methods**:
- Use multiple distribution and distance measurement methods (such as Wasserstein distance, Jensen - Shannon divergence, etc.) to evaluate the effectiveness of the intermediate features.
### Experimental results
- **Intermediate features are superior to embeddings**: The experimental results show that the metrics using intermediate features perform better on multiple datasets, especially having higher robustness when dealing with geometric transformations.
- **DINOv1 performs the best**: In particular, the DINOv1 model performs the best in all experiments, indicating that its intermediate features are very effective for low - level perceptual similarity tasks.
### Conclusion
This research proves that using the intermediate features of the foundation model can significantly improve the accuracy and robustness of low - level perceptual similarity metrics. Future work will focus on the fine - tuning of these features to further improve performance.
### Formula examples
The distance measurement formulas involved in the paper are as follows:
- **Cosine distance**:
\[
\text{Cosine Distance} = 1-\frac{\mathbf{A}\cdot\mathbf{B}}{\|\mathbf{A}\|\|\mathbf{B}\|}
\]
- **Wasserstein distance**:
\[
W(P, Q)=\inf_{\gamma\in\Pi(P, Q)}\int_{\mathcal{X}\times\mathcal{X}}d(x, y)\,d\gamma(x, y)
\]
where \(P\) and \(Q\) are two probability distributions, and \(d(x, y)\) is the distance function between sample points.
Through these methods, the author shows the potential of intermediate features in low - level perceptual similarity metrics and provides a new direction for future research.