COVER: A Comprehensive Video Quality Evaluator

Chenlong He,Qi Zheng,Ruoxi Zhu,Xiaoyang Zeng,Yibo Fan,Zhengzhong Tu
DOI: https://doi.org/10.1109/cvprw63382.2024.00589
2024-01-01
Computer Vision and Pattern Recognition
Abstract:Video quality assessment, especially for a massive scale of user-generated content, is an essential yet challenging computer vision and video analysis problem. Prior methods have been shown to be effective in mirroring subjective human opinion scores; however, they fail to capture the complicated, multi-dimensional aspects of factors that impact the overall perceptual quality. In this paper, we introduce COVER, a comprehensive video quality evaluator, a novel framework designed to evaluate video quality holistically — from a technical, aesthetic, and semantic perspective. Specifically, COVER leverages three parallel branches: (1) a Swin Transformer backbone implemented on spatially sampled crops to predict technical quality; (2) a ConvNet employed on subsampled frames to derive aesthetic quality; (3) a CLIP image encoder executed on re-sized frames to obtain semantic quality. We further propose a simplified cross-gating block to interact with the three branches before feeding into the predicting head. The final quality score is attained using a weighted sum of each sub-score, making a multi-faceted metric. Our experimental results demonstrate that COVER exceeds the state-of-the-art models in multiple UGC video quality datasets. Moreover, COVER offers a diagnosable quality report to explain the quality score in multiple pillars, while it is capable of processing 1080p videos at 3x faster speed than the real-time requirement. To facilitate future research on efficient and explainable video quality research, the code is available at https://github.com/vztu/COVER.
What problem does this paper attempt to address?