Abstract:The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at <a class="link-external link-https" href="https://github.com/Q-Future/Q-Bench" rel="external noopener nofollow">this https URL</a>.

LocateBench: Evaluating the Locating Ability of Vision Language Models

AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

VLind-Bench: Measuring Language Priors in Large Vision-Language Models

GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks

Good at captioning, bad at counting: Benchmarking GPT-4V on Earth observation data

Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

DiffuSyn Bench: Evaluating Vision-Language Models on Real-World Complexities with Diffusion-Generated Synthetic Benchmarks

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated Images

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

MMBench: Is Your Multi-modal Model an All-around Player?

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

@Bench: Benchmarking Vision-Language Models for Human-centered Assistive Technology

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond Scaling

MVP-Bench: Can Large Vision--Language Models Conduct Multi-level Visual Perception Like Humans?

DevBench: A multimodal developmental benchmark for language learning