CoSTA: Co-training Spatial-Temporal Attention for Blind Video Quality Assessment

Fengchuang Xing,Yuan-Gen Wang,Weixuan Tang,Guopu Zhu,Sam Kwong
DOI: https://doi.org/10.1016/j.eswa.2024.124651
IF: 8.5
2024-01-01
Expert Systems with Applications
Abstract:Self-attention-based Transformer has achieved great success in many computer vision tasks. However, its application to blind video quality assessment (VQA) is far from comprehensive. Evaluating the quality of in-thewild videos is challenging due to the unknown of pristine reference and shooting distortion. This paper presents a Co-trained Space-Time Attention network for the blind VQA problem, termed CoSTA. Specifically, we first build CoSTA by alternately concatenating the divided space-time attention. Then, to facilitate the training of CoSTA, we design a vectorized regression loss by encoding the mean opinion score (MOS) to the probability vector and embedding a special token as the learnable variable of MOS, leading to the better fitting of the human rating process. Finally, to solve the data-hungry problem within Transformer, we propose to co-train the spatial and temporal attention weights using both images and videos. Various experiments are conducted on the de-facto in-the-wild video datasets, including LIVE-Qualcomm, LIVE-VQC, KoNViD-1k, YouTube-UGC, LSVQ, LSVQ-1080p, and DVL2021. Experimental results demonstrate the superiority of the proposed CoSTA over the state-of-the-art. The source code is publicly available at https://github.com/GZHU-DVL/CoSTA.
What problem does this paper attempt to address?