VisualCritic: Making LMMs Perceive Visual Quality Like Humans

Zhipeng Huang,Zhizheng Zhang,Yiting Lu,Zheng-Jun Zha,Zhibo Chen,Baining Guo

2024-03-19

Abstract:At present, large multimodal models (LMMs) have exhibited impressive generalization capabilities in understanding and generating visual signals. However, they currently still lack sufficient capability to perceive low-level visual quality akin to human perception. Can LMMs achieve this and show the same degree of generalization in this regard? If so, not only could the versatility of LMMs be further enhanced, but also the challenge of poor cross-dataset performance in the field of visual quality assessment could be addressed. In this paper, we explore this question and provide the answer "Yes!". As the result of this initial exploration, we present VisualCritic, the first LMM for broad-spectrum image subjective quality assessment. VisualCritic can be used across diverse data right out of box, without any requirements of dataset-specific adaptation operations like conventional specialist models. As an instruction-following LMM, VisualCritic enables new capabilities of (1) quantitatively measuring the perceptual quality of given images in terms of their Mean Opinion Score (MOS), noisiness, colorfulness, sharpness, and other numerical indicators, (2) qualitatively evaluating visual quality and providing explainable descriptions, (3) discerning whether a given image is AI-generated or photographic. Extensive experiments demonstrate the efficacy of VisualCritic by comparing it with other open-source LMMs and conventional specialist models over both AI-generated and photographic images.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem this paper attempts to address is the inadequacy of large multimodal models (LMMs) in perceiving low-level visual quality, making them unable to accurately assess the subjective quality of images like humans do. Specifically, while existing LMMs excel in understanding high-level semantics and generating visual signals, there is a significant gap between their performance and human perception when it comes to evaluating low-level visual quality indicators such as brightness, color saturation, contrast, noise, and sharpness. Additionally, traditional specialized models, although performing well within specific datasets, show poor performance in cross-dataset evaluations and require adaptive adjustments for specific datasets, limiting their practical applications. To address these issues, the paper proposes VisualCritic, the first LMM capable of broadly evaluating the subjective quality of images. VisualCritic can perform quantitative evaluations across various datasets and provide qualitative assessments and authenticity detection (i.e., determining whether an image is AI-generated or real). Through this model, the researchers aim to enhance the generality and versatility of LMMs in visual quality assessment, bringing them closer to human perceptual capabilities.

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

LLaVA-Critic: Learning to Evaluate Multimodal Models

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

HumanEval-V: Evaluating Visual Understanding and Reasoning Abilities of Large Multimodal Models Through Coding Tasks

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Mitigating Perception Bias: A Training-Free Approach to Enhance LMM for Image Quality Assessment

AesBench: An Expert Benchmark for Multimodal Large Language Models on Image Aesthetics Perception

Are We on the Right Way for Evaluating Large Vision-Language Models?

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

What is the Visual Cognition Gap between Humans and Multimodal LLMs?

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models