Q-Ground: Image Quality Grounding with Large Multi-modality Models

Chaofeng Chen,Sensen Yang,Haoning Wu,Liang Liao,Zicheng Zhang,Annan Wang,Wenxiu Sun,Qiong Yan,Weisi Lin

2024-07-24

Abstract:Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In this work, we introduce Q-Ground, the first framework aimed at tackling fine-scale visual quality grounding by combining large multi-modality models with detailed visual quality analysis. Central to our contribution is the introduction of the QGround-100K dataset, a novel resource containing 100k triplets of (image, quality text, distortion segmentation) to facilitate deep investigations into visual quality. The dataset comprises two parts: one with human-labeled annotations for accurate quality assessment, and another labeled automatically by LMMs such as GPT4V, which helps improve the robustness of model training while also reducing the costs of data collection. With the QGround-100K dataset, we propose a LMM-based method equipped with multi-scale feature learning to learn models capable of performing both image quality answering and distortion segmentation based on text prompts. This dual-capability approach not only refines the model's understanding of region-aware image quality but also enables it to interactively respond to complex, text-based queries about image quality and specific distortions. Q-Ground takes a step towards sophisticated visual quality analysis in a finer scale, establishing a new benchmark for future research in the area. Codes and dataset are available at <a class="link-external link-https" href="https://github.com/Q-Future/Q-Ground" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the issue in the field of Image Quality Assessment (IQA) where existing methods mainly focus on overall quality assessment, while detailed inspection of local quality details remains underexplored. Specifically, current methods have limitations in evaluating and interpreting image quality, especially in terms of local distortions and fine-grained analysis. Therefore, this paper introduces the Visual Quality Grounding task, aiming to fill this gap by combining large multimodal models (LMMs) with detailed visual quality analysis. To achieve this goal, the authors propose the following contributions: 1. **Proposed Framework**: For the first time, a framework aimed at fine-grained visual quality grounding is proposed, leveraging the advantages of large multimodal models for detailed visual quality analysis. 2. **Dataset Construction**: The QGround-100K dataset is constructed, containing 100,000 sets of (image, quality text, distortion segmentation) samples, with part of the data annotated by humans and the other part automatically generated by LMMs, to support in-depth research on visual quality. 3. **Multi-Scale Feature Extractor**: A Multi-Scale Feature Extractor (MSFA) is introduced to enhance the model's perception of low-level and mid-level details, thereby achieving image quality assessment and distortion segmentation. 4. **New Benchmark**: A new benchmark is established, providing a more refined and complex direction for future research in image quality analysis. Through these contributions, the paper aims to advance the field of image quality assessment, particularly in fine-grained quality and local distortion analysis.

Q-Ground: Image Quality Grounding with Large Multi-modality Models

Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment

IQAGPT: Image Quality Assessment with Vision-language and ChatGPT Models

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

GroundingGPT:Language Enhanced Multi-modal Grounding Model

McmIQA: Multi-Module Collaborative Model for No-Reference Image Quality Assessment

SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

Descriptive Image Quality Assessment in the Wild

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs

Towards Open-ended Visual Quality Comparison

Molecular (functional) imaging for radiotherapy applications: an RTOG symposium.

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Learning a No-Reference Quality Assessment Model of Enhanced Images With Big Data

LMM-PCQA: Assisting Point Cloud Quality Assessment with LMM