Abstract:Traditional metrics for evaluating video quality do not completely capture the nuances of the Human Visual System (HVS), however they are simple to use for quantitatively optimizing parameters in enhancement or restoration. Modern Full-Reference Perceptual Visual Quality Metrics (PVQMs) such as the video multi-method assessment fusion (VMAF) function are more robust than traditional metrics in terms of the HVS, but they are generally complex and non-differentiable. This lack of differentiability means that they cannot be readily used in optimization scenarios for enhancement or restoration. In this paper we look at the formulation of a perceptually motivated restoration framework for video. We deploy this process in the context of denoising by training a spatio-temporal denoiser deep convultional neural network (DCNN). We design DCNNs as a differentiable proxy for both a spatial and temporal version of VMAF. These proxies are used as part of the proposed loss function in updating the weights of the spatio-temporal DCNNs. We use these proxies and traditional losses to propose a perceptually motivated loss function for video. Our results show that using the perceptual loss function as a fine tuning step yields a higher VMAF score and lower PSNR, when compared to the spatio-temporal network that is trained using the traditional mean squared error loss. Using the perceptual loss function for the entirety of training yields a lower VMAF and PSNR, but has visibly less noise in its output.

A differentiable VMAF proxy as a loss function for video noise reduction