VQA$^2$:Visual Question Answering for Video Quality Assessment

Ziheng Jia,Zicheng Zhang,Jiaying Qian,Haoning Wu,Wei Sun,Chunyi Li,Xiaohong Liu,Weisi Lin,Guangtao Zhai,Xiongkuo Min
2024-11-06
Abstract:The advent and proliferation of large multi-modal models (LMMs) have introduced a new paradigm to video-related computer vision fields, including training and inference methods based on visual question answering (VQA). These methods enable models to handle multiple downstream tasks robustly. Video Quality Assessment (VQA), a classic field in low-level visual quality evaluation, originally focused on quantitative video quality scoring. However, driven by advances in LMMs, it is now evolving towards more comprehensive visual quality understanding tasks. Visual question answering has significantly improved low-level visual evaluation within the image domain recently. However, related work is almost nonexistent in the video domain, leaving substantial room for improvement. To address this gap, we introduce the VQA2 Instruction Dataset the first visual question answering instruction dataset entirely focuses on video quality assessment, and based on it, we propose the VQA2 series models The VQA2 Instruction Dataset consists of three stages and covers various video types, containing 157,735 instruction question-answer pairs, including both manually annotated and synthetic data. We conduct extensive experiments on both video quality scoring and video quality understanding tasks. Results demonstrate that the VQA2 series models achieve state-of-the-art (SOTA) performance in quality scoring tasks, and their performance in visual quality question answering surpasses the renowned GPT-4o. Additionally, our final model, the VQA2-Assistant, performs well across both scoring and question-answering tasks, validating its versatility.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the field of video quality assessment (VQA), most of the existing models can only perform quantitative video quality scoring, lacking the ability to understand and analyze video quality details. Specifically: 1. **The gap between quantitative scoring and quality understanding**: Although there are currently many models that can score the overall quality of videos, these models hardly have the ability to understand the local visual quality of videos and cannot provide detailed video quality analysis. This leads to the limitation of the models in terms of functionality, especially when it is necessary to carefully assess the quality of specific parts of videos. 2. **Insufficient functional diversity of video quality assessment models**: Most video quality assessment models are only applicable to a single type of video (such as user - generated content or streaming media videos), lacking the ability to handle multiple types of videos. In addition, the existing models mainly focus on the quality assessment of static images and have a weak ability to perceive the time - dimension and motion quality attributes unique to videos. To bridge the above - mentioned gaps, the author proposes the VQA2Instruction dataset and the VQA2 series of models developed based on this dataset. By constructing a large - scale instruction dataset, the author aims to enhance the model's ability in accurate quantitative quality scoring and delicate quality understanding while maintaining the model's functional diversity. The VQA2 series of models not only achieve state - of - the - art performance in multiple video quality scoring tasks but also outperform the well - known GPT - 4o model in video quality understanding and question - answering tasks, demonstrating their versatility in quality scoring and understanding tasks.