Abstract:Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthetics-aware pretraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution to quality and aesthetics assessment. To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetic information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms. Code and models will be released at \url{<a class="link-external link-https" href="https://github.com/KeiChiTse/QPT-V2" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are the scarcity of data annotation and the lack of generalization ability in the existing Visual Scoring (VS) tasks. Specifically, tasks such as Image Quality Assessment (IQA), Video Quality Assessment (VQA), and Image Aesthetic Assessment (IAA) require the evaluation of the quality and aesthetics of visual content. However, the existing learning - based methods perform poorly in terms of generalization ability due to the scarcity of annotated data. ### Specific Goals of the Paper 1. **Unify Visual Scoring Tasks**: The paper proposes a new pre - training framework, QPT V2, aiming to provide a unified solution for IQ, VQA, and IAA through Masked Image Modeling (MIM). 2. **Improve the Perceptual Ability of the Model**: In order to better capture high - resolution and high - frequency details, the paper makes improvements in three aspects: pre - training data, degradation strategy, and model structure, so as to enhance the model's ability to perceive high - quality and aesthetic information. 3. **Verify the Effectiveness of MIM**: The paper verifies the potential of MIM in visual scoring tasks and shows its superior performance on multiple benchmarks through experiments. ### Main Contributions 1. **Verify the Effectiveness of MIM in Visual Scoring Tasks for the First Time**: The paper decomposes MIM into three key components: data, degradation, and model, and studies their impacts respectively. 2. **Propose the QPT V2 Framework**: This is the first pre - training framework based on MIM, providing a unified solution for visual scoring tasks. Through targeted improvements to data, degradation, and model, it enhances MIM's ability to acquire prior knowledge. 3. **Achieve SOTA Results on Multiple Benchmarks**: QPT V2 achieves the best or near - best results in 11 benchmark tests, surpassing other pre - training paradigms. Extensive ablation experiments verify the effectiveness of each improvement. ### Method Overview - **Data**: The paper uses datasets with high - resolution (HR) and high - foreground - coverage (HFC) to ensure that the model can capture rich textures and local structures. - **Degradation**: Multiple degradation types and combination strategies are introduced to simulate various distortion situations in the real world, thereby enhancing the model's ability to perceive different distortions. - **Model**: A multi - scale feature - fusion module is adopted, enabling the model to perceive quality and aesthetic information at different scales, which is closer to the multi - scale evaluation method of the human visual system. ### Experimental Results The paper conducts experiments on multiple benchmarks of visual scoring tasks, including synthetic datasets and real - world datasets. The experimental results show that QPT V2 outperforms the existing state - of - the - art methods in most cases, especially when dealing with real - world data. In conclusion, by proposing the QPT V2 framework, this paper solves the problems of scarce data annotation and insufficient generalization ability in visual scoring tasks, providing new ideas and methods for future related research.

QPT V2: Masked Image Modeling Advances Visual Scoring

Revealing the Dark Secrets of Masked Image Modeling

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

A self-supervised image aesthetic assessment combining masked image modeling and contrastive learning

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Adept: Annotation-denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining

Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

Toward High Quality Facial Representation Learning

Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Masked Image Modeling Advances 3D Medical Image Analysis

MedIM: Boost Medical Image Representation via Radiology Report-Guided Masking

Rethinking masked image modelling for medical image representation

Quality-aware Pre-trained Models for Blind Image Quality Assessment

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling

Kernel Masked Image Modeling Through the Lens of Theoretical Understanding

Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering

Contextual Image Masking Modeling via Synergized Contrasting without View Augmentation for Faster and Better Visual Pretraining.

Symmetric masking strategy enhances the performance of Masked Image Modeling