Abstract:Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of excessive computational cost in the training of video super - resolution (SR) models. Specifically, existing methods enhance low - resolution (LR) bitstreams by transmitting over - fitted SR model streams to reconstruct high - resolution (HR) videos. However, these methods face huge computational costs when training a large number of video frames, which limits their practical applications. To address this challenge, the authors propose an efficient patch sampling method, called EPS (Efficient Patch Sampling), for over - fitting training of video SR networks. EPS selects the most informative training patches by directly utilizing the spatio - temporal features of the discrete cosine transform (DCT) basis, thereby reducing the number of patches required for training while maintaining high - quality video output and significantly improving training efficiency. ### Main problems and solutions 1. **Excessive computational cost**: Existing methods need to perform PSNR (Peak Signal - to - Noise Ratio) heat map calculations for all frames, which is not only time - consuming but also requires additional computational resources. 2. **Ignoring spatio - temporal redundancy**: Existing methods sample patches only based on SR quality comparison without considering the temporal redundancy between frames, resulting in unnecessary computational load. To solve these problems, EPS introduces two low - complexity DCT - based features to evaluate the spatio - temporal complexity of each LR - HR patch pair. By analyzing the histogram distribution of these features, all possible patches are classified into different clusters, and training patches are selected from the cluster with the highest spatio - temporal information. This method not only reduces the number of patches required for training (4% to 25%), but also significantly reduces the overall running time (83% reduction compared to the state - of - the - art patch sampling method EMT). ### Formula representation - **Spatial feature (SF)**: \[ SF=\sum_{i = 0}^{w - 1}\sum_{j = 0}^{h - 1}e^{\left(\frac{ij}{wh}\right)^2 - 1}|DCT(i, j)| \] where \(w\) and \(h\) are the width and height of the patch respectively, and \(DCT(i, j)\) is the \((i, j)\) - th DCT component (0 when \(i + j>0\), otherwise). - **Temporal feature (TF)**: \[ TF_t=\sum_{i = 0}^{w - 1}\sum_{j = 0}^{h - 1}e^{\left(\frac{ij}{wh}\right)^2 - 1}|DCT(i, j)_t - DCT(i, j)_{t - 1}| \] where \(t\) represents the current frame, \(T\) is the total number of frames, and \(I_1, I_2,\cdots, I_T\) are all frames of the given LR video. Through these improvements, EPS can significantly reduce the computational cost and time overhead during the training process while ensuring video quality.

EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training

Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting

SamplingAug: on the Importance of Patch Sampling Augmentation for Single Image Super-Resolution.

Practical super-resolution from dynamic video sequences

Improved Low-Bitrate HEVC Video Coding Using Deep Learning Based Super-Resolution and Adaptive Block Patching.

Efficient Meta-Tuning for Content-Aware Neural Video Delivery.

Data Overfitting for On-Device Super-Resolution with Dynamic Algorithm and Compiler Co-Design

PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution

Video Super-Resolution Reconstruction Based on Deep Learning and Spatio-Temporal Feature Self-similarity

Enhanced Video Super-Resolution Network Towards Compressed Data

Perceptually Optimized Super Resolution

Video Super-Resolution Via a Spatio-Temporal Alignment Network.

LENS: Bandwidth-efficient video analytics with adaptive super resolution

Accelerating the Training of Video Super-Resolution Models

Learning for Unconstrained Space-Time Video Super-Resolution

Learning Frequency-aware Dynamic Network for Efficient Super-Resolution.

Deep Parametric 3D Filters for Joint Video Denoising and Illumination Enhancement in Video Super Resolution.

Low-Cost Video Super-Resolution Assisted by Event Signals

Image super-resolution via sparse representation.

Real-Time Video Super-Resolution with Spatio-Temporal Modeling and Redundancy-Aware Inference

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution