VILA: Learning Image Aesthetics from User Comments with Vision-Language Pretraining

Junjie Ke,Keren Ye,Jiahui Yu,Yonghui Wu,Peyman Milanfar,Feng Yang
2023-06-03
Abstract:Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problems that this paper attempts to solve are several key challenges in Image Aesthetic Assessment (IAA): 1. **Subjectivity**: IAA is essentially subjective and depends on multiple factors, such as composition, color usage, photographic style, and subject matter. Therefore, traditional human - scoring - based methods may oversimplify visual aesthetic information. 2. **Data utilization efficiency**: Existing IAA methods mainly rely on manually - annotated scoring data, which is not only time - consuming and costly, but also these scores lack context information and cannot fully reflect why an image has or does not have aesthetic value. 3. **Multi - modal information fusion**: User comments provide more comprehensive information and can naturally express people's views and preferences on image aesthetics. However, how to effectively use this text information to improve the performance of IAA models is a challenge. To this end, the paper proposes a framework named VILA (Vision - Language Aesthetics) to solve the above problems in the following ways: - **Using image - comment pairs for pre - training**: Through contrastive learning and generation goals, rich aesthetic semantics are learned from image - comment pairs without the need for manual labels. - **Lightweight ranking adapter**: A lightweight ranking adapter is proposed, using text as an anchor point to learn aesthetic ranking concepts, thereby efficiently adapting the pre - trained model in downstream IAA tasks. - **Zero - shot learning ability**: The pre - trained aesthetic vision - language model performs excellently in zero - shot aesthetic tasks (such as zero - shot style classification and IAA), surpassing many supervised baseline models. Specifically, the VILA framework is divided into two stages: 1. **Pre - training stage**: - Visual - language pre - training is carried out using image - comment pairs, and fine - grained knowledge in aesthetic image - comment pairs is fully exploited through contrastive learning and text - sequence - generation goals. - The CoCa architecture is adopted, combining contrastive learning and image - to - text generation to optimize these two goals simultaneously in a single framework. 2. **Fine - tuning stage**: - Using the lightweight ranking adapter (VILA - R), the frozen image embeddings are adjusted by adding feature residuals, making high - quality images closer to the "good image" text anchor point and low - quality images farther away from this anchor point. - Only a small number of parameters need to be adjusted to effectively improve the performance of the model in IAA tasks while retaining the zero - shot and generation capabilities of the pre - trained model. Through these methods, VILA has achieved significant performance improvements on multiple benchmark datasets, especially reaching the state - of - the - art level in the IAA task on the AVA dataset.