Abstract:The proliferation of in-the-wild videos has greatly expanded the Video Quality Assessment (VQA) problem. Unlike early definitions that usually focus on limited distortion types, VQA on in-the-wild videos is especially challenging as it could be affected by complicated factors, including various distortions and diverse contents. Though subjective studies have collected overall quality scores for these videos, how the abstract quality scores relate with specific factors is still obscure, hindering VQA methods from more concrete quality evaluations (e.g. sharpness of a video). To solve this problem, we collect over two million opinions on 4,543 in-the-wild videos on 13 dimensions of quality-related factors, including in-capture authentic distortions (e.g. motion blur, noise, flicker), errors introduced by compression and transmission, and higher-level experiences on semantic contents and aesthetic issues (e.g. composition, camera trajectory), to establish the multi-dimensional Maxwell database. Specifically, we ask the subjects to label among a positive, a negative, and a neutral choice for each dimension. These explanation-level opinions allow us to measure the relationships between specific quality factors and abstract subjective quality ratings, and to benchmark different categories of VQA algorithms on each dimension, so as to more comprehensively analyze their strengths and weaknesses. Furthermore, we propose the MaxVQA, a language-prompted VQA approach that modifies vision-language foundation model CLIP to better capture important quality issues as observed in our analyses. The MaxVQA can jointly evaluate various specific quality factors and final quality scores with state-of-the-art accuracy on all dimensions, and superb generalization ability on existing datasets. Code and data available at <a class="link-external link-https" href="https://github.com/VQAssessment/MaxVQA" rel="external noopener nofollow">this https URL</a>.

WildQA: In-the-Wild Video Question Answering

Video Question Answering: Datasets, Algorithms and Challenges

ActivityNet-QA: A Dataset for Understanding Complex Web Videos Via Question Answering.

Video Question Answering: a Survey of Models and Datasets

Learning to Answer Visual Questions from Web Videos

Towards Explainable In-the-Wild Video Quality Assessment: A Database and a Language-Prompted Approach

VQA$^2$:Visual Question Answering for Video Quality Assessment

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

AVQA: A Dataset for Audio-Visual Question Answering on Videos

TVQA: Localized, Compositional Video Question Answering

ChiQA: A Large Scale Image-based Real-World Question Answering Dataset for Multi-Modal Understanding

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering

AI-VQA

Unified Quality Assessment of in-the-Wild Videos with Mixed Datasets Training

AI-VQA: Visual Question Answering based on Agent Interaction with Interpretability

NuScenes-QA: A Multi-modal Visual Question Answering Benchmark for Autonomous Driving Scenario

Recognizing Video Activities in the Wild Via View-to-Scene Joint Learning

Equivariant and Invariant Grounding for Video Question Answering

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

VQA: Visual Question Answering

VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs