Abstract:Image Aesthetic Assessment (IAA) is a vital and intricate task that entails analyzing and assessing an image's aesthetic values, and identifying its highlights and areas for improvement. Traditional methods of IAA often concentrate on a single aesthetic task and suffer from inadequate labeled datasets, thus impairing in-depth aesthetic comprehension. Despite efforts to overcome this challenge through the application of Multi-modal Large Language Models (MLLMs), such models remain underdeveloped for IAA purposes. To address this, we propose a comprehensive aesthetic MLLM capable of nuanced aesthetic insight. Central to our approach is an innovative multi-scale text-guided self-supervised learning technique. This technique features a multi-scale feature alignment module and capitalizes on a wealth of unlabeled data in a self-supervised manner to structurally and functionally enhance aesthetic ability. The empirical evidence indicates that accompanied with extensive instruct-tuning, our model sets new state-of-the-art benchmarks across multiple tasks, including aesthetic scoring, aesthetic commenting, and personalized image aesthetic assessment. Remarkably, it also demonstrates zero-shot learning capabilities in the emerging task of aesthetic suggesting. Furthermore, for personalized image aesthetic assessment, we harness the potential of in-context learning and showcase its inherent advantages.

What problem does this paper attempt to address?

This paper attempts to solve several key problems in the field of Image Aesthetic Assessment (IAA): 1. **Limitations of Single - task**: - Traditional methods usually focus on a single aesthetic task, such as Aesthetic Scoring (AS), Aesthetic Comment (AC) or Personalized Image Aesthetic Assessment (PIAA). This leads to insufficient understanding of the associations between different tasks in the model and is prone to over - fitting specific tasks. 2. **Scarcity of Annotated Data**: - Aesthetic assessment tasks require a large amount of annotated data, but existing methods often rely on limited annotated datasets, making it difficult for the model to deeply understand aesthetic features. Although Multimodal Large Language Models (MLLMs) have been used to improve IAA, their application in aesthetic tasks is still not mature enough. 3. **Lack of Comprehensive Aesthetic Information Integration**: - Existing models mainly rely on semantic features and ignore a large amount of aesthetic information, especially when dealing with multi - scale features. In addition, the quality and diversity of pseudo - labels also limit the effectiveness of self - supervised learning. To solve these problems, the author proposes a new model named Comprehensive Aesthetic Large language Model (CALM). The main contributions of CALM include: - **Multi - scale Text - guided Self - supervised Learning Technique**: - By introducing a Multi - scale Feature Alignment Module (MFAM), CALM can extract multi - level aesthetic features from the visual encoder and use a large amount of unannotated data for self - supervised learning. - Specifically, MFAM can capture aesthetic features at multiple levels and generate accurate pseudo - labels in a text - guided manner, thereby improving the learning efficiency and accuracy of the model. - **Zero - shot Learning Ability**: - CALM demonstrates zero - shot learning ability in new tasks (such as aesthetic suggestions), proving its strong aesthetic understanding and analysis ability. - **Cross - task Comprehensive Aesthetic Insight**: - Through two - stage instruction fine - tuning, CALM can achieve excellent performance in multiple tasks such as aesthetic scoring, aesthetic comment, and personalized aesthetic assessment, especially showing excellent performance in the personalized aesthetic assessment task. In summary, this paper aims to overcome the limitations of existing IAA methods and achieve more comprehensive and in - depth aesthetic understanding by proposing a new multimodal large - language model and its unique learning method.

Advancing Comprehensive Aesthetic Insight with Multi-Scale Text-Guided Self-Supervised Learning

Self-Adaptive Computational Aesthetic Evaluation of Chinese Ink Paintings Based on Deep Learning

Inkthetics: A Comprehensive Computational Model for Aesthetic Evaluation of Chinese Ink Paintings.

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

A Comprehensive Survey on Computational Aesthetic Evaluation of Visual Art Images: Metrics and Challenges

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

Collaborative and Attentive Learning for Personalized Image Aesthetic Assessment

Revisiting Image Aesthetic Assessment via Self-Supervised Feature Learning

Multi-modal Learnable Queries for Image Aesthetics Assessment

AACP: Aesthetics assessment of children's paintings based on self-supervised learning

Semantic and style based multiple reference learning for artistic and general image aesthetic assessment

Textual Aesthetics in Large Language Models

Learning Image Aesthetic Assessment from Object-level Visual Components

User-Guided Personalized Image Aesthetic Assessment based on Deep Reinforcement Learning

UMAAF: Unveiling Aesthetics via Multifarious Attributes of Images

Image Aesthetic Assessment Assisted by Attributes Through Adversarial Learning.

Neural aesthetic image reviewer

Context-aware Attention Network for Predicting Image Aesthetic Subjectivity

Personality-assisted Multi-task Learning for Generic and Personalized Image Aesthetics Assessment.

Personalized Image Aesthetics Assessment via Meta-Learning With Bilevel Gradient Optimization