Abstract:The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of $5\%$ in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at <a class="link-external link-https" href="https://github.com/Sueqk/LMM-VQA" rel="external noopener nofollow">this https URL</a>.

Q-Boost: on Visual Quality Assessment Ability of Low-level Multi-Modality Foundation Models

A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment

Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs

2AFC Prompting of Large Multimodal Models for Image Quality Assessment

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models

Mitigating Perception Bias: A Training-Free Approach to Enhance LMM for Image Quality Assessment

Blind Multimodal Quality Assessment of Low-light Images

NoiseBoost: Alleviating Hallucination with Noise Perturbation for Multimodal Large Language Models

Q-Mamba: On First Exploration of Vision Mamba for Image Quality Assessment

PCQA: A Strong Baseline for AIGC Quality Assessment Based on Prompt Condition

Multi-Modal Prompt Learning on Blind Image Quality Assessment

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare

Dog-IQA: Standard-guided Zero-shot MLLM for Mix-grained Image Quality Assessment

Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels

Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining

Q-Ground: Image Quality Grounding with Large Multi-modality Models

Grounding-IQA: Multimodal Language Grounding Model for Image Quality Assessment

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs