Abstract:The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at <a class="link-external link-https" href="https://github.com/Q-Future/Q-Bench" rel="external noopener nofollow">this https URL</a>.

Benchmarking Multimodal Models for Ukrainian Language Understanding Across Academic and Cultural Domains

MMBench: Is Your Multi-modal Model an All-around Player?

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation

UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models

A Survey on Benchmarks of Multimodal Large Language Models

OmniBench: Towards The Future of Universal Omni-Language Models

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

Benchmarking Vision Language Models for Cultural Understanding

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models

From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

OMGEval: an Open Multilingual Generative Evaluation Benchmark for Large Language Models

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

A Survey on Multimodal Benchmarks: In the Era of Large AI Models

Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning