NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

Jukka I. Ahonen,Nam Le,Honglei Zhang,Antti Hallapuro,Francesco Cricri,Hamed Rezazadegan Tavakoli,Miska M. Hannuksela,Esa Rahtu
2024-01-19
Abstract:The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to -43.20% and -26.8% Bjøntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to compress image and video data more efficiently in machine vision tasks so as to save bandwidth and improve task performance. Specifically, the paper points out that although traditional video coding standards (such as HEVC or VVC) perform excellently in video compression for human viewing, they are not fully suitable for machine vision tasks. Especially under low - bit - rate conditions, the performance of machine tasks often fails to reach the expected level. Therefore, the paper proposes a new hybrid encoder NN - VVC, aiming to combine the advantages of the end - to - end learned image encoder (LIC) and the traditional video encoder (CVC) to achieve more optimized image and video coding for machine vision tasks. ### Key problems solved in the paper: 1. **Limitations of traditional encoders**: Traditional video encoders (such as VVC) are mainly optimized for human viewing and have deficiencies in machine vision tasks, especially at low bit rates. 2. **Limitations of end - to - end learned encoders**: Although the end - to - end learned image encoder (LIC) performs excellently in image coding, it has not reached the level of traditional encoders in video coding and lacks interoperability. 3. **Combining the advantages of both**: By combining the advantages of LIC and CVC, a hybrid encoder NN - VVC is proposed to achieve higher coding efficiency and better machine task performance. ### Main contributions of the paper: - **Proposing the NN - VVC system**: This system uses the self - supervised learned image encoder (LIC) in intra - frame coding, utilizes traditional video coding tools (CVC) in inter - frame coding, and provides a fallback mode when necessary. - **Introducing adapters**: Including the intra - human adapter (IHA) and the inter - machine adapter (IMA), which are used to enhance the intra - frame images reconstructed by LIC and the inter - frame images reconstructed by CVC, making them more suitable for machine vision tasks. - **Experimental verification**: Through experiments on multiple datasets and machine vision tasks, it is proved that the coding efficiency of NN - VVC on image and video data is significantly better than that of VVC, achieving up to 43.20% and 26.8% Bjøntegaard Delta rate reductions respectively. ### Specific technical details: - **LIC (Learned Image Codec)**: A self - supervised learned image encoder based on convolutional neural network (CNN) for intra - frame coding. - **IHA (Intra Human Adapter)**: Used to process the intra - frame images reconstructed by LIC, remove coding artifacts, and improve the quality as a reference frame. - **IMA (Inter Machine Adapter)**: Used to process the inter - frame images reconstructed by CVC and enhance their performance in machine vision tasks. - **Fallback Mode**: In extremely low - bit - rate situations, close the LIC branch and use only CVC for coding to maintain coding efficiency. Through these innovations, the paper successfully solves the problem of efficiently compressing image and video data in machine vision tasks and provides new ideas for future video coding standards.