A single convolutional neural network for joint super-resolution, gamut extension, and inverse tone-mapping
Wenyao Gan,Hensheng Zhang,Li Chen,Rong Xie,Li Song
2020-01-01
Abstract:With rapid developments of display technology in recent years, Ultra-high definition (UHD) high dynamic range (HDR) displays have emerged in consumer markets. However, due to the lack of UHD HDR video contents, it is necessary to convert legacy high definition (HD) videos with standard dynamic range (SDR) to their UHD HDR versions. In this paper, we first introduce a workflow to down-convert existing UHD HDR videos to their HD SDR versions and then propose a joint super-resolution, gamut extension, and inverse tone-mapping network (JSGIN), which directly learns the upconversion from the HD SDR videos to their UHD HDR versions. Our JSGIN can enhance visual experience by reconstructing lost information and achieves better subjective visual quality with fewer artifacts than recent state-of-the-art methods. INTRODUCTION Display technology has developed fast in recent years, Ultra-high definition (UHD) higher dynamic range (HDR) displays have become available for consumers. Nevertheless, because of the shortage of UHD HDR video contents, it is required to up-convert legacy high definition (HD) standard dynamic range (SDR) videos to UHD HDR videos. Compared with the current HD SDR television systems ‘(1)’, UHD television systems ‘(2)’ provide higher spatial resolution and wider color gamut, and HDR television systems ‘(3)’ provide a higher dynamic range. Super-resolution (SR) methods up-scale low-resolution images to high-resolution images. Recent convolutional neural network (CNN) based methods have achieved considerable improvements over conventional SR methods. SRCNN ‘Dong et al (4)’ was the first CNNbased SR method. Then, the CNN architecture was improved by various methods such as sub-pixel convolution ‘Shi et al (5)’ and modified residual blocks ‘Lim et al (6)’. Gamut extension (GE) algorithms extend colors from a source gamut to a wider destination gamut. Linear color space conversion cannot restore color information outside the source gamut. Conventional GE algorithms attempt to make full use of the wider destination gamut. Recently, ‘TAKEUCHI et al (7)’ proposed a CNN-based GE algorithm that achieves significant gains against conventional GE algorithms. Inverse tone-mapping (ITM) methods expand SDR images to HDR images. Compared with conventional ITM methods that only focus on mapping the dynamic range, CNN-based ITM methods can restore the lost details in highlights and shadows. ‘Eilertsen et al (8)’ introduced a deep learning system to reconstruct an HDR image from a single exposed SDR image. UHD HDR videos can be reconstructed from HD SDR videos by cascading SR, GE, and ITM methods. However, the errors from the previous conversion may accumulate, which leads to less accurate results and more overall complexity compared with the joint learning of SR, GE, and ITM. A multi-purpose CNN structure ‘Kim and Kim (9)’ was first proposed to perform the joint learning task of SR, GE, and ITM to directly up-convert HD SDR videos to UHD HDR videos. Then, Deep SR-ITM ‘Kim et al (10)’ was proposed to achieve better results than ‘(9)’ by introducing input decomposition methods and modulation blocks. ResNet ‘He et al (11)’ introduces local residual learning to ease the difficulty of training of deep CNNs. Global residual learning in SR was first adopted by VDSR ‘Kim et al (12)’ to facilitate training convergence for a deep CNN. Both local residual learning and global residual learning are adopted in our method. In this paper, we first introduce a workflow to down-convert the existing UHD HDR videos to their HD SDR versions. Then, we propose a single CNN to jointly learn SR, GE, and ITM, which can directly up-convert HD SDR videos to their UHD HDR versions. Compared to recent state-of-the-art methods ‘(9) (10)’, UHD HDR videos generated by our method provide a better visual experience. METHODOLOGY To train our network, both UHD HDR videos and their HD SDR versions are required. In our paper, UHD HDR videos collected by ‘(10)’ are used as ground truth. Their resolution is 4K (3840 × 2160), bit depth is 10, and opto-electronic transfer function (OETF) is Perceptual Quantization (PQ). Different from ‘(10)’ where the automatic conversion process of YouTube is used to convert HDR videos to their SDR versions, we introduce a workflow to downconvert the UHD HDR videos to their HD SDR versions. Down-conversion For Creating Our Dataset Figure 1 shows the workflow of down-conversion from UHD HDR videos to their HD SDR versions. In the 1st step, digitally represented luminance and color-difference signals [DY,2020 ′ , DCB,2020 ′ , DCR,2020 ′ ] in the bit-depth of 10 bits are inverse-quantized to normalized luminance and color-difference signals [ EY,2020 ′ , ECB,2020 ′ , ECR,2020 ′ ] according to Recommendation ITU-R BT.2020 ‘(2)’ as follows: EY,2020 ′ = (DY,2020 ′ /4 − 16)/219, ECB,2020 ′ = (DCB,2020 ′ /4 − 128)/224, ECR,2020 ′ = (DCR,2020 ′ /4 − 128)/224. In the 2nd step, luminance and color-difference signals [EY,2020 ′ , ECB,2020 ′ , ECR,2020 ′ ] are converted to RGB color signals [ER,2020 ′ , EG,2020 ′ , EB,2020 ′ ] according to Recommendation ITUR BT.2020 ‘(2)’ as follows: