Abstract:Vision transformers (ViTs) excel in computer vision for modeling long-term dependencies, yet face two key challenges for image quality assessment (IQA): discarding fine details during patch embedding, and requiring extensive training data due to lack of inductive biases. In this study, we propose a Global-Local progressive INTegration network for IQA, called GlintIQA, to address these issues through three key components: 1) Hybrid feature extraction combines ViT-based global feature extractor (VGFE) and convolutional neural networks (CNNs)-based local feature extractor (CLFE) to capture global coarse-grained features and local fine-grained features, respectively. The incorporation of CNNs mitigates the patch-level information loss and inductive bias constraints inherent to ViT architectures. 2) Progressive feature integration leverages diverse kernel sizes in embedding to spatially align coarse- and fine-grained features, and progressively aggregate these features by interactively stacking channel-wise attention and spatial enhancement modules to build effective quality-aware representations. 3) Content similarity-based labeling approach is proposed that automatically assigns quality labels to images with diverse content based on subjective quality scores. This addresses the scarcity of labeled training data in synthetic datasets and bolsters model generalization. The experimental results demonstrate the efficacy of our approach, yielding 5.04% average SROCC gains on cross-authentic dataset evaluations. Moreover, our model and its counterpart pre-trained on the proposed dataset respectively exhibited 5.40% and 13.23% improvements on across-synthetic datasets evaluation. The codes and proposed dataset will be released at <a class="link-external link-https" href="https://github.com/XiaoqiWang/GlintIQA" rel="external noopener nofollow">this https URL</a>.

Boosting Image Quality Assessment Through Efficient Transformer Adaptation with Local Feature Enhancement

Local Distortion Aware Efficient Transformer Adaptation for Image Quality Assessment

Image Quality Assessment with Transformers and Multi-Metric Fusion Modules

MSTRIQ: No Reference Image Quality Assessment Based on Swin Transformer with Multi-Stage Fusion

Auxiliary Information Guided Self-Attention for Image Quality Assessment

Global-Local Progressive Integration Network for Blind Image Quality Assessment

DistilIQA: Distilling vision transformers for no-reference perceptual CT image quality assessment

Transformer-based No-Reference Image Quality Assessment via Supervised Contrastive Learning

TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment

Blind Image Quality Index with Cross-Domain Interaction and Cross-Scale Integration

Blind Image Quality Assessment for In-the-wild Images by Integrating Distorted Patch Selection and Multi-Scale-and-granularity Fusion

Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token

DTSN: No-Reference Image Quality Assessment Via Deformable Transformer and Semantic Network

Cross-IQA: Unsupervised Learning for Image Quality Assessment

ARET-IQA: an Aspect-Ratio-Embedded Transformer for Image Quality Assessment

TTL-IQA: Transitive Transfer Learning Based No-Reference Image Quality Assessment

DMvLNet: Deep Multiview Learning Network for Blindly Accessing Image Quality

No-Reference Image Quality Assessment Via Local and Global Multi-Scale Feature Integration

No-reference image quality assessment based on global awareness

Unifying Dual-Attention and Siamese Transformer Network for Full-Reference Image Quality Assessment

DSMix: Distortion-Induced Sensitivity Map Based Pre-training for No-Reference Image Quality Assessment