Abstract:Leveraging the overfitting property of deep neural networks (DNNs) is trending in video delivery systems to enhance quality within bandwidth limits. Existing approaches transmit overfitted super-resolution (SR) model streams for low-resolution (LR) bitstreams, which are used to reconstruct high-resolution (HR) videos at the decoder. Although these approaches show promising results, the huge computational costs of training a large number of video frames limit their practical applications. To overcome this challenge, we propose an efficient patch sampling method named EPS for video SR network overfitting, which identifies the most valuable training patches from video frames. To this end, we first present two low-complexity Discrete Cosine Transform (DCT)-based spatial-temporal features to measure the complexity score of each patch directly. By analyzing the histogram distribution of these features, we then categorize all possible patches into different clusters and select training patches from the cluster with the highest spatial-temporal information. The number of sampled patches is adaptive based on the video content, addressing the trade-off between training complexity and efficiency. Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters, while maintaining high video quality and significantly enhancing training efficiency. Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of excessive computational cost in the training of video super - resolution (SR) models. Specifically, existing methods enhance low - resolution (LR) bitstreams by transmitting over - fitted SR model streams to reconstruct high - resolution (HR) videos. However, these methods face huge computational costs when training a large number of video frames, which limits their practical applications.
To address this challenge, the authors propose an efficient patch sampling method, called EPS (Efficient Patch Sampling), for over - fitting training of video SR networks. EPS selects the most informative training patches by directly utilizing the spatio - temporal features of the discrete cosine transform (DCT) basis, thereby reducing the number of patches required for training while maintaining high - quality video output and significantly improving training efficiency.
### Main problems and solutions
1. **Excessive computational cost**: Existing methods need to perform PSNR (Peak Signal - to - Noise Ratio) heat map calculations for all frames, which is not only time - consuming but also requires additional computational resources.
2. **Ignoring spatio - temporal redundancy**: Existing methods sample patches only based on SR quality comparison without considering the temporal redundancy between frames, resulting in unnecessary computational load.
To solve these problems, EPS introduces two low - complexity DCT - based features to evaluate the spatio - temporal complexity of each LR - HR patch pair. By analyzing the histogram distribution of these features, all possible patches are classified into different clusters, and training patches are selected from the cluster with the highest spatio - temporal information. This method not only reduces the number of patches required for training (4% to 25%), but also significantly reduces the overall running time (83% reduction compared to the state - of - the - art patch sampling method EMT).
### Formula representation
- **Spatial feature (SF)**:
\[
SF=\sum_{i = 0}^{w - 1}\sum_{j = 0}^{h - 1}e^{\left(\frac{ij}{wh}\right)^2 - 1}|DCT(i, j)|
\]
where \(w\) and \(h\) are the width and height of the patch respectively, and \(DCT(i, j)\) is the \((i, j)\) - th DCT component (0 when \(i + j>0\), otherwise).
- **Temporal feature (TF)**:
\[
TF_t=\sum_{i = 0}^{w - 1}\sum_{j = 0}^{h - 1}e^{\left(\frac{ij}{wh}\right)^2 - 1}|DCT(i, j)_t - DCT(i, j)_{t - 1}|
\]
where \(t\) represents the current frame, \(T\) is the total number of frames, and \(I_1, I_2,\cdots, I_T\) are all frames of the given LR video.
Through these improvements, EPS can significantly reduce the computational cost and time overhead during the training process while ensuring video quality.