VQA$^2$:Visual Question Answering for Video Quality Assessment

Ziheng Jia,Zicheng Zhang,Jiaying Qian,Haoning Wu,Wei Sun,Chunyi Li,Xiaohong Liu,Weisi Lin,Guangtao Zhai,Xiongkuo Min

2024-11-06

Abstract:The advent and proliferation of large multi-modal models (LMMs) have introduced a new paradigm to video-related computer vision fields, including training and inference methods based on visual question answering (VQA). These methods enable models to handle multiple downstream tasks robustly. Video Quality Assessment (VQA), a classic field in low-level visual quality evaluation, originally focused on quantitative video quality scoring. However, driven by advances in LMMs, it is now evolving towards more comprehensive visual quality understanding tasks. Visual question answering has significantly improved low-level visual evaluation within the image domain recently. However, related work is almost nonexistent in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset the first visual question answering instruction dataset entirely focuses on video quality assessment, and based on it, we propose the VQA2 series models The VQA2 Instruction Dataset consists of three stages and covers various video types, containing 157,735 instruction question-answer pairs, including both manually annotated and synthetic data. We conduct extensive experiments on both video quality scoring and video quality understanding tasks. Results demonstrate that the VQA2 series models achieve state-of-the-art (SOTA) performance in quality scoring tasks, and their performance in visual quality question answering surpasses the renowned GPT-4o. Additionally, our final model, the VQA2-Assistant, performs well across both scoring and question-answering tasks, validating its versatility.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that in the field of video quality assessment (VQA), most of the existing models can only perform quantitative video quality scoring, lacking the ability to understand and analyze video quality details. Specifically: 1. **The gap between quantitative scoring and quality understanding**: Although there are currently many models that can score the overall quality of videos, these models hardly have the ability to understand the local visual quality of videos and cannot provide detailed video quality analysis. This leads to the limitation of the models in terms of functionality, especially when it is necessary to carefully assess the quality of specific parts of videos. 2. **Insufficient functional diversity of video quality assessment models**: Most video quality assessment models are only applicable to a single type of video (such as user - generated content or streaming media videos), lacking the ability to handle multiple types of videos. In addition, the existing models mainly focus on the quality assessment of static images and have a weak ability to perceive the time - dimension and motion quality attributes unique to videos. To bridge the above - mentioned gaps, the author proposes the VQA2Instruction dataset and the VQA2 series of models developed based on this dataset. By constructing a large - scale instruction dataset, the author aims to enhance the model's ability in accurate quantitative quality scoring and delicate quality understanding while maintaining the model's functional diversity. The VQA2 series of models not only achieve state - of - the - art performance in multiple video quality scoring tasks but also outperform the well - known GPT - 4o model in video quality understanding and question - answering tasks, demonstrating their versatility in quality scoring and understanding tasks.

VQA$^2$:Visual Question Answering for Video Quality Assessment

Simple and Effective Visual Question Answering in a Single Modality

AVQA: A Dataset for Audio-Visual Question Answering on Videos

Video Question Answering: Datasets, Algorithms and Challenges

A survey on VQA_Datasets and Approaches

Video Quality Assessment: A Comprehensive Survey

Video Question Answering: a Survey of Models and Datasets

LMM-VQA: Advancing Video Quality Assessment with Large Multimodal Models

AI-VQA

VideoQA in the Era of LLMs: An Empirical Study

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

VQA: Visual Question Answering

Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering

Unified Quality Assessment of in-the-Wild Videos with Mixed Datasets Training

Equivariant and Invariant Grounding for Video Question Answering

OpenViVQA: Task, Dataset, and Multimodal Fusion Models for Visual Question Answering in Vietnamese

Natural Language Understanding and Inference with MLLM in Visual Question Answering: A Survey

AIGV-Assessor: Benchmarking and Evaluating the Perceptual Quality of Text-to-Video Generation with LMM