LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Qihang Ge,Wei Sun,Yu Zhang,Yunhao Li,Zhongpeng Ji,Fengyu Sun,Shangling Jui,Xiongkuo Min,Guangtao Zhai

2024-08-26

Abstract:The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of $5\%$ in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at <a class="link-external link-https" href="https://github.com/Sueqk/LMM-VQA" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The paper attempts to address the problem of Blind Video Quality Assessment (BVQA) in Video Quality Assessment (VQA). Specifically, the paper proposes a method based on Large Multimodal Models (LMM) — LMM-VQA, aiming to overcome the limitations of existing methods in handling complex real-world videos, particularly in terms of generalization capability. Traditional BVQA methods either rely on handcrafted features, which are highly interpretable but have limited performance, or are based on data-driven approaches, which perform well on specific datasets but poorly on out-of-distribution (OOD) data. To address these issues, LMM-VQA combines a spatiotemporal enhanced visual encoder, a spatiotemporal projection module, and a Large Language Model (LLM) to achieve effective and robust video quality assessment, achieving state-of-the-art performance in multiple benchmarks. Additionally, LMM-VQA also excels in general video understanding tasks, further validating its effectiveness.

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

VideoQA in the Era of LLMs: An Empirical Study

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Video-CCAM: Enhancing Video-Language Understanding with Causal Cross-Attention Masks for Short and Long Videos

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

VQA$^2$:Visual Question Answering for Video Quality Assessment

Q-Bench-Video: Benchmarking the Video Quality Understanding of LMMs

Multi-modal Auto-regressive Modeling via Visual Words

VLM-Eval: A General Evaluation on Video Large Language Models

LongVLM: Efficient Long Video Understanding via Large Language Models

Audio-Visual LLM for Video Understanding

Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering

VideoLLM: Modeling Video Sequence with Large Language Models

Video Quality Assessment: A Comprehensive Survey

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

Revisiting Multi-Modal LLM Evaluation

VideoLLM-online: Online Video Large Language Model for Streaming Video