Feature Maps Correlation-based Video Quality Assessment
Amir Hossein Bakhtiari,Azadeh Mansouri
DOI: https://doi.org/10.1007/s11042-023-18068-w
IF: 2.577
2024-01-13
Multimedia Tools and Applications
Abstract:Blind video quality assessment (BVQA) techniques try to assess the perceived quality of a degraded video with no prior knowledge of the reference. Deep learning-based techniques have been used in different approaches so far. These methods frequently pool frame-level features to create a video representation and assess quality. The features are conventionally taken from the final convolutional layers of the network, or the mid-layers at times. Regardless of the details and information about the frames' appearance, such approaches generally assume that degradations affect the high-level features and general patterns taken from the last layers. The methods mentioned above mainly have to utilize ensemble techniques because of the relatively poor correlation between video quality and such features. We introduce a novel method in this study to acquire frame-level deep features for assessing the quality of videos. To accomplish this, we look at the deep feature maps correlations of specific layers of a pre-trained network, or more specifically, their similarities as helpful features for assessing video quality. The covariance matrix i.e. the Gram matrix, which depicts the correlation between all feature maps of a specific mid-layer, can be stated as deep feature relationships. The structural details of each frame's texture and color, in other words, frame's appearance, are reflected in these relations and significantly correlate with the perceived quality of a given video. In fact, the extracted feature maps relations in different granularities can effectively illustrate the influence of various distortions. The experimental results on three UGC video quality benchmarks, including YouTube-UGC, KoNViD-1k, and LIVE-VQC individual datasets depict acceptable results. As one can see, the resultant SROCCs using the proposed features extracted from the EfficientNet B4 network, show improvements of around 10%, 10%, and 7%, on YouTube-UGC, KoNViD-1k, and LIVE-VQC respectively, compared to typical features using last convolutional layers (avgpool). Moreover, the average SROCC results in 4 out of 6 cross-dataset tests is around 0.22% higher compared to the state-of-the-art where the SVR is trained on YouTube-UGC or KoNViD-1k. Thus, employing feature maps correlation of mid-layers of a pre-trained network as frame-level feature provides better cross-dataset results using the proposed computationally efficient method. The implementation of our method is available at https://github.com/amirh-bakhtiari/FMC-VQA.
computer science, information systems, theory & methods,engineering, electrical & electronic, software engineering