AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

Yipo Huang,Quan Yuan,Xiangfei Sheng,Zhichao Yang,Haoning Wu,Pengfei Chen,Yuzhe Yang,Leida Li,Weisi Lin
2024-01-16
Abstract:With collective endeavors, multimodal large language models (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind groping may impede the further development of more advanced MLLMs with aesthetic perception capacity. To address this dilemma, we propose AesBench, an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs through elaborate design across dual facets. (1) We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. (2) We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI). Extensive experimental results underscore that the current MLLMs only possess rudimentary aesthetic perception ability, and there is still a significant gap between MLLMs and humans. We hope this work can inspire the community to engage in deeper explorations on the aesthetic potentials of MLLMs. Source data will be available at
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
This paper attempts to address the problem of performance evaluation of multimodal large language models (MLLMs) in image aesthetic perception tasks. Specifically, the existing MLLMs' capabilities in handling image aesthetic perception have not been fully explored and evaluated, yet this capability is very important in practical applications. However, there is currently a lack of specialized benchmarks to evaluate the effectiveness of MLLMs in aesthetic perception, which may impede the development of more advanced MLLMs. To solve this dilemma, the author proposes AesBench, an expert - level benchmarking platform, aiming to comprehensively evaluate the aesthetic perception capabilities of MLLMs. ### Main problems: 1. **Lack of dedicated benchmarks**: Existing benchmarks mainly focus on general - purpose language or visual tasks, such as visual question answering, image caption generation, etc., but are insufficient in evaluating the effectiveness of the highly abstract task of image aesthetic perception. 2. **Unknown performance of MLLMs in aesthetic perception**: Although MLLMs perform well in other tasks, their capabilities in aesthetic perception are still unclear, especially in identifying, empathizing, evaluating, and explaining aesthetic attributes. ### Solutions: 1. **Construct a high - quality dataset**: The author constructs a high - quality aesthetic perception database named EAPD (Expert - labeled Aesthetics Perception Database), which contains 2,800 images from different sources and is labeled by professional aesthetic experts. 2. **Propose comprehensive evaluation criteria**: The author proposes a set of criteria for systematically evaluating the aesthetic perception capabilities of MLLMs from four dimensions (perception, empathy, evaluation, and explanation). 3. **Extensive experimental verification**: The author uses AesBench to extensively evaluate 15 well - known MLLMs, including two authoritative models GPT - 4V and Gemini Pro Vision, as well as 13 open - source models. Through these efforts, the author hopes that AesBench can inspire the community to further study the potential of MLLMs in aesthetic perception in - depth and promote the development of more advanced MLLMs.