Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

Yuti Liu,Shice Liu,Junyuan Gao,Pengtao Jiang,Hao Zhang,Jinwei Chen,Bo Li
2024-12-17
Abstract:Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve several key problems in the field of Image Aesthetic Assessment (IAA): 1. **Limitations of Single - task**: - Traditional methods usually focus on a single aesthetic task, such as Aesthetic Scoring (AS), Aesthetic Comment (AC) or Personalized Image Aesthetic Assessment (PIAA). This leads to insufficient understanding of the associations between different tasks in the model and is prone to over - fitting specific tasks. 2. **Scarcity of Annotated Data**: - Aesthetic assessment tasks require a large amount of annotated data, but existing methods often rely on limited annotated datasets, making it difficult for the model to deeply understand aesthetic features. Although Multimodal Large Language Models (MLLMs) have been used to improve IAA, their application in aesthetic tasks is still not mature enough. 3. **Lack of Comprehensive Aesthetic Information Integration**: - Existing models mainly rely on semantic features and ignore a large amount of aesthetic information, especially when dealing with multi - scale features. In addition, the quality and diversity of pseudo - labels also limit the effectiveness of self - supervised learning. To solve these problems, the author proposes a new model named Comprehensive Aesthetic Large language Model (CALM). The main contributions of CALM include: - **Multi - scale Text - guided Self - supervised Learning Technique**: - By introducing a Multi - scale Feature Alignment Module (MFAM), CALM can extract multi - level aesthetic features from the visual encoder and use a large amount of unannotated data for self - supervised learning. - Specifically, MFAM can capture aesthetic features at multiple levels and generate accurate pseudo - labels in a text - guided manner, thereby improving the learning efficiency and accuracy of the model. - **Zero - shot Learning Ability**: - CALM demonstrates zero - shot learning ability in new tasks (such as aesthetic suggestions), proving its strong aesthetic understanding and analysis ability. - **Cross - task Comprehensive Aesthetic Insight**: - Through two - stage instruction fine - tuning, CALM can achieve excellent performance in multiple tasks such as aesthetic scoring, aesthetic comment, and personalized aesthetic assessment, especially showing excellent performance in the personalized aesthetic assessment task. In summary, this paper aims to overcome the limitations of existing IAA methods and achieve more comprehensive and in - depth aesthetic understanding by proposing a new multimodal large - language model and its unique learning method.