Abstract:The explosive growth of videos on streaming media platforms has underscored the urgent need for effective video quality assessment (VQA) algorithms to monitor and perceptually optimize the quality of streaming videos. However, VQA remains an extremely challenging task due to the diverse video content and the complex spatial and temporal distortions, thus necessitating more advanced methods to address these issues. Nowadays, large multimodal models (LMMs), such as GPT-4V, have exhibited strong capabilities for various visual understanding tasks, motivating us to leverage the powerful multimodal representation ability of LMMs to solve the VQA task. Therefore, we propose the first Large Multi-Modal Video Quality Assessment (LMM-VQA) model, which introduces a novel spatiotemporal visual modeling strategy for quality-aware feature extraction. Specifically, we first reformulate the quality regression problem into a question and answering (Q&A) task and construct Q&A prompts for VQA instruction tuning. Then, we design a spatiotemporal vision encoder to extract spatial and temporal features to represent the quality characteristics of videos, which are subsequently mapped into the language space by the spatiotemporal projector for modality alignment. Finally, the aligned visual tokens and the quality-inquired text tokens are aggregated as inputs for the large language model (LLM) to generate the quality score and level. Extensive experiments demonstrate that LMM-VQA achieves state-of-the-art performance across five VQA benchmarks, exhibiting an average improvement of $5\%$ in generalization ability over existing methods. Furthermore, due to the advanced design of the spatiotemporal encoder and projector, LMM-VQA also performs exceptionally well on general video understanding tasks, further validating its effectiveness. Our code will be released at <a class="link-external link-https" href="https://github.com/Sueqk/LMM-VQA" rel="external noopener nofollow">this https URL</a>.

Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems

Dual Path Multi-Modal High-Order Features for Textual Content Based Visual Question Answering

Simple and Effective Visual Question Answering in a Single Modality

Modality attention fusion model with hybrid multi-head self-attention for video understanding

Information Fusion in Visual Question Answering: A Survey

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering

Spontaneous regression of orbital Langerhans cell granulomatosis in a three-year-old girl.

Multi-modal adaptive gated mechanism for visual question answering

A Multimodal Sentiment Analysis Approach Based on a Joint Chained Interactive Attention Mechanism

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering.

Multi-stage hybrid embedding fusion network for visual question answering

Multi-Question Learning for Visual Question Answering

Medical visual question answering with symmetric interaction attention and cross-modal gating

Frame Augmented Alternating Attention Network for Video Question Answering.

Two-Stage Multimodality Fusion for High-Performance Text-Based Visual Question Answering.

Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

Robust video question answering via contrastive cross-modality representation learning

Enhancing visual question answering with a two‐way co‐attention mechanism and integrated multimodal features

The multi-modal fusion in visual question answering: a review of attention mechanisms

Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion