Abstract:The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at <a class="link-external link-https" href="https://github.com/Q-Future/Q-Bench" rel="external noopener nofollow">this https URL</a>.

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models

DevBench: A multimodal developmental benchmark for language learning

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

An Interpretability Evaluation Benchmark for Pre-trained Language Models

Multimodal Pretraining from Monolingual to Multilingual

BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Scalable Performance Analysis for Vision-Language Models

BLiMP: The Benchmark of Linguistic Minimal Pairs for English

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension

VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

On Bilingual Lexicon Induction with Large Language Models

Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

MMBench: Is Your Multi-modal Model an All-around Player?

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks