QPT V2: Masked Image Modeling Advances Visual Scoring

Qizhi Xie,Kun Yuan,Yunpeng Qu,Mingda Wu,Ming Sun,Chao Zhou,Jihong Zhu
2024-07-23
Abstract:Quality assessment and aesthetics assessment aim to evaluate the perceived quality and aesthetics of visual content. Current learning-based methods suffer greatly from the scarcity of labeled data and usually perform sub-optimally in terms of generalization. Although masked image modeling (MIM) has achieved noteworthy advancements across various high-level tasks (e.g., classification, detection etc.). In this work, we take on a novel perspective to investigate its capabilities in terms of quality- and aesthetics-awareness. To this end, we propose Quality- and aesthetics-aware pretraining (QPT V2), the first pretraining framework based on MIM that offers a unified solution to quality and aesthetics assessment. To perceive the high-level semantics and fine-grained details, pretraining data is curated. To comprehensively encompass quality- and aesthetics-related factors, degradation is introduced. To capture multi-scale quality and aesthetic information, model structure is modified. Extensive experimental results on 11 downstream benchmarks clearly show the superior performance of QPT V2 in comparison with current state-of-the-art approaches and other pretraining paradigms. Code and models will be released at \url{<a class="link-external link-https" href="https://github.com/KeiChiTse/QPT-V2" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are the scarcity of data annotation and the lack of generalization ability in the existing Visual Scoring (VS) tasks. Specifically, tasks such as Image Quality Assessment (IQA), Video Quality Assessment (VQA), and Image Aesthetic Assessment (IAA) require the evaluation of the quality and aesthetics of visual content. However, the existing learning - based methods perform poorly in terms of generalization ability due to the scarcity of annotated data. ### Specific Goals of the Paper 1. **Unify Visual Scoring Tasks**: The paper proposes a new pre - training framework, QPT V2, aiming to provide a unified solution for IQ, VQA, and IAA through Masked Image Modeling (MIM). 2. **Improve the Perceptual Ability of the Model**: In order to better capture high - resolution and high - frequency details, the paper makes improvements in three aspects: pre - training data, degradation strategy, and model structure, so as to enhance the model's ability to perceive high - quality and aesthetic information. 3. **Verify the Effectiveness of MIM**: The paper verifies the potential of MIM in visual scoring tasks and shows its superior performance on multiple benchmarks through experiments. ### Main Contributions 1. **Verify the Effectiveness of MIM in Visual Scoring Tasks for the First Time**: The paper decomposes MIM into three key components: data, degradation, and model, and studies their impacts respectively. 2. **Propose the QPT V2 Framework**: This is the first pre - training framework based on MIM, providing a unified solution for visual scoring tasks. Through targeted improvements to data, degradation, and model, it enhances MIM's ability to acquire prior knowledge. 3. **Achieve SOTA Results on Multiple Benchmarks**: QPT V2 achieves the best or near - best results in 11 benchmark tests, surpassing other pre - training paradigms. Extensive ablation experiments verify the effectiveness of each improvement. ### Method Overview - **Data**: The paper uses datasets with high - resolution (HR) and high - foreground - coverage (HFC) to ensure that the model can capture rich textures and local structures. - **Degradation**: Multiple degradation types and combination strategies are introduced to simulate various distortion situations in the real world, thereby enhancing the model's ability to perceive different distortions. - **Model**: A multi - scale feature - fusion module is adopted, enabling the model to perceive quality and aesthetic information at different scales, which is closer to the multi - scale evaluation method of the human visual system. ### Experimental Results The paper conducts experiments on multiple benchmarks of visual scoring tasks, including synthetic datasets and real - world datasets. The experimental results show that QPT V2 outperforms the existing state - of - the - art methods in most cases, especially when dealing with real - world data. In conclusion, by proposing the QPT V2 framework, this paper solves the problems of scarce data annotation and insufficient generalization ability in visual scoring tasks, providing new ideas and methods for future related research.