Abstract:With collective endeavors, multimodal large language models (MLLMs) are undergoing a flourishing development. However, their performances on image aesthetics perception remain indeterminate, which is highly desired in real-world applications. An obvious obstacle lies in the absence of a specific benchmark to evaluate the effectiveness of MLLMs on aesthetic perception. This blind groping may impede the further development of more advanced MLLMs with aesthetic perception capacity. To address this dilemma, we propose AesBench, an expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs through elaborate design across dual facets. (1) We construct an Expert-labeled Aesthetics Perception Database (EAPD), which features diversified image contents and high-quality annotations provided by professional aesthetic experts. (2) We propose a set of integrative criteria to measure the aesthetic perception abilities of MLLMs from four perspectives, including Perception (AesP), Empathy (AesE), Assessment (AesA) and Interpretation (AesI). Extensive experimental results underscore that the current MLLMs only possess rudimentary aesthetic perception ability, and there is still a significant gap between MLLMs and humans. We hope this work can inspire the community to engage in deeper explorations on the aesthetic potentials of MLLMs. Source data will be available at

What problem does this paper attempt to address?

This paper attempts to address the problem of performance evaluation of multimodal large language models (MLLMs) in image aesthetic perception tasks. Specifically, the existing MLLMs' capabilities in handling image aesthetic perception have not been fully explored and evaluated, yet this capability is very important in practical applications. However, there is currently a lack of specialized benchmarks to evaluate the effectiveness of MLLMs in aesthetic perception, which may impede the development of more advanced MLLMs. To solve this dilemma, the author proposes AesBench, an expert - level benchmarking platform, aiming to comprehensively evaluate the aesthetic perception capabilities of MLLMs. ### Main problems: 1. **Lack of dedicated benchmarks**: Existing benchmarks mainly focus on general - purpose language or visual tasks, such as visual question answering, image caption generation, etc., but are insufficient in evaluating the effectiveness of the highly abstract task of image aesthetic perception. 2. **Unknown performance of MLLMs in aesthetic perception**: Although MLLMs perform well in other tasks, their capabilities in aesthetic perception are still unclear, especially in identifying, empathizing, evaluating, and explaining aesthetic attributes. ### Solutions: 1. **Construct a high - quality dataset**: The author constructs a high - quality aesthetic perception database named EAPD (Expert - labeled Aesthetics Perception Database), which contains 2,800 images from different sources and is labeled by professional aesthetic experts. 2. **Propose comprehensive evaluation criteria**: The author proposes a set of criteria for systematically evaluating the aesthetic perception capabilities of MLLMs from four dimensions (perception, empathy, evaluation, and explanation). 3. **Extensive experimental verification**: The author uses AesBench to extensively evaluate 15 well - known MLLMs, including two authoritative models GPT - 4V and Gemini Pro Vision, as well as 13 open - source models. Through these efforts, the author hopes that AesBench can inspire the community to further study the potential of MLLMs in aesthetic perception in - depth and promote the development of more advanced MLLMs.

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark

A Survey on Benchmarks of Multimodal Large Language Models

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

A-Bench: Are LMMs Masters at Evaluating AI-generated Images?

Can MLLMs Understand the Deep Implication Behind Chinese Images?

Textual Aesthetics in Large Language Models

Evaluating and Advancing Multimodal Large Language Models in Ability Lens

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

Neural aesthetic image reviewer

MMBench: Is Your Multi-modal Model an All-around Player?