Abstract:Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.

What problem does this paper attempt to address?

The problems that this paper attempts to solve are several key challenges in Image Aesthetic Assessment (IAA): 1. **Subjectivity**: IAA is essentially subjective and depends on multiple factors, such as composition, color usage, photographic style, and subject matter. Therefore, traditional human - scoring - based methods may oversimplify visual aesthetic information. 2. **Data utilization efficiency**: Existing IAA methods mainly rely on manually - annotated scoring data, which is not only time - consuming and costly, but also these scores lack context information and cannot fully reflect why an image has or does not have aesthetic value. 3. **Multi - modal information fusion**: User comments provide more comprehensive information and can naturally express people's views and preferences on image aesthetics. However, how to effectively use this text information to improve the performance of IAA models is a challenge. To this end, the paper proposes a framework named VILA (Vision - Language Aesthetics) to solve the above problems in the following ways: - **Using image - comment pairs for pre - training**: Through contrastive learning and generation goals, rich aesthetic semantics are learned from image - comment pairs without the need for manual labels. - **Lightweight ranking adapter**: A lightweight ranking adapter is proposed, using text as an anchor point to learn aesthetic ranking concepts, thereby efficiently adapting the pre - trained model in downstream IAA tasks. - **Zero - shot learning ability**: The pre - trained aesthetic vision - language model performs excellently in zero - shot aesthetic tasks (such as zero - shot style classification and IAA), surpassing many supervised baseline models. Specifically, the VILA framework is divided into two stages: 1. **Pre - training stage**: - Visual - language pre - training is carried out using image - comment pairs, and fine - grained knowledge in aesthetic image - comment pairs is fully exploited through contrastive learning and text - sequence - generation goals. - The CoCa architecture is adopted, combining contrastive learning and image - to - text generation to optimize these two goals simultaneously in a single framework. 2. **Fine - tuning stage**: - Using the lightweight ranking adapter (VILA - R), the frozen image embeddings are adjusted by adding feature residuals, making high - quality images closer to the "good image" text anchor point and low - quality images farther away from this anchor point. - Only a small number of parameters need to be adjusted to effectively improve the performance of the model in IAA tasks while retaining the zero - shot and generation capabilities of the pre - trained model. Through these methods, VILA has achieved significant performance improvements on multiple benchmark datasets, especially reaching the state - of - the - art level in the IAA task on the AVA dataset.

VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Inkthetics: A Comprehensive Computational Model for Aesthetic Evaluation of Chinese Ink Paintings.

Joint Image and Text Representation for Aesthetics Analysis

Neural aesthetic image reviewer

Comment-Guided Semantics-Aware Image Aesthetics Assessment

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

Learning Image Aesthetic Assessment from Object-level Visual Components

UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment

Image Aesthetics Assessment via Learnable Queries

VILA: On Pre-training for Visual Language Models

Image Aesthetics Assessment With Attribute-Assisted Multimodal Memory Network

Aesthetic Image Captioning From Weakly-Labelled Photographs

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

Self-Adaptive Computational Aesthetic Evaluation of Chinese Ink Paintings Based on Deep Learning

Aesthetic Visual Question Answering of Photographs

Aesthetic image captioning on the FAE-Captions dataset

A deep architecture for unified aesthetic prediction

Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment Analysis

Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2

A self-supervised image aesthetic assessment combining masked image modeling and contrastive learning