Abstract:With the development of VR-related techniques, viewers can enjoy a realistic and immersive experience through a head-mounted display, while omnidirectional video with a low frame rate can lead to user dizziness. However, the prevailing plane frame interpolation methodologies are unsuitable for Omnidirectional Video Interpolation, chiefly due to the lack of models tailored to such videos with strong distortion, compounded by the scarcity of valuable datasets for Omnidirectional Video Frame Interpolation. In this paper, we introduce the benchmark dataset, 360VFI, for Omnidirectional Video Frame Interpolation. We present a practical implementation that introduces a distortion prior from omnidirectional video into the network to modulate distortions. We especially propose a pyramid distortion-sensitive feature extractor that uses the unique characteristics of equirectangular projection (ERP) format as prior information. Moreover, we devise a decoder that uses an affine transformation to facilitate the synthesis of intermediate frames further. 360VFI is the first dataset and benchmark that explores the challenge of Omnidirectional Video Frame Interpolation. Through our benchmark analysis, we presented four different distortion conditions scenes in the proposed 360VFI dataset to evaluate the challenge triggered by distortion during interpolation. Besides, experimental results demonstrate that Omnidirectional Video Interpolation can be effectively improved by modeling for omnidirectional distortion.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the low - frame - rate problem in omnidirectional video frame interpolation (Omnidirectional Video Frame Interpolation, Omni - VFI), especially the image distortion problem caused by spherical projection (such as equirectangular projection, ERP). Specifically:
1. **The problem of user dizziness caused by low frame rate**:
- Omnidirectional videos (ODV) usually have a low frame rate, which will cause users to feel dizzy when watching with a head - mounted display (HMD).
2. **Limitations of existing planar video frame interpolation methods**:
- Existing planar video frame interpolation methods cannot be directly applied to omnidirectional videos because these methods do not take into account the distortion characteristics unique to omnidirectional videos, especially the latitude - dependent distortion in ERP format.
3. **Lack of specialized data sets and benchmark tests**:
- At present, there is a lack of data sets and benchmark tests specifically for omnidirectional video frame interpolation, making it difficult to evaluate and improve related algorithms.
To solve these problems, the author proposes a new data set and benchmark test platform - 360VFI, aiming to promote the research and development of omnidirectional video frame interpolation. Specific contributions include:
- **Constructed the first omnidirectional video frame interpolation data set 360VFI**, covering various motion and content scenarios.
- **Proposed a new benchmark test method**, which can evaluate the effect of omnidirectional video frame interpolation under different vertical motion conditions.
- **Designed an effective network architecture 360VFI Net**, which improves the interpolation effect by introducing distortion prior knowledge, especially in large - motion scenarios.
### Mathematical formula summary
1. **ERP distortion formula**:
The distortion degree of ERP is determined by latitude and can be expressed by the following formula:
\[
K_{\text{ERP}}(x, y)=\frac{\delta S}{\delta P}=\cos(y)
\]
where \(\delta S\) represents the area on the sphere, \(\delta P\) represents the area on the projection plane, \(x\in(-\pi,\pi)\), \(y\in(-\frac{\pi}{2},\frac{\pi}{2})\).
2. **Calculation of distortion condition map**:
For the input \(X\in\mathbb{R}^{C\times M\times N}\), the distortion condition map \(C_d\in\mathbb{R}^{1\times M\times N}\) can be calculated by the following formula:
\[
C_d = \cos\left(\frac{m + 0.5-\frac{M}{2}}{M}\pi\right)
\]
where \(m\) is the height of the current input frame.
3. **Feature extraction formula of DistortionGuard module**:
Use deformable convolution layers to extract features with less distortion from the input feature map:
\[
\tilde{\phi}_0^l,\tilde{\phi}_1^l=H_{\text{DCN}}(\tilde{\phi}_0,\tilde{\phi}_1,H_{\text{offset}}(C_d))
\]
where \(H_{\text{DCN}}(\cdot)\) represents the standard deformable convolution layer.
4. **Affine transformation formula of OmniFTB module**:
In the intermediate feature reconstruction stage, use the distortion condition map for affine transformation:
\[
\omega = M_{\theta'}(\phi)
\]
where \(M_{\theta'}\) is a mapping function based on the ERP distortion condition map.
Through these methods, the author has successfully solved the key problems in omnidirectional video frame interpolation and provided a solid foundation for future research.