Abstract:The recent progress in artificial intelligence has led to an ever-increasing usage of images and videos by machine analysis algorithms, mainly neural networks. Nonetheless, compression, storage and transmission of media have traditionally been designed considering human beings as the viewers of the content. Recent research on image and video coding for machine analysis has progressed mainly in two almost orthogonal directions. The first is represented by end-to-end (E2E) learned codecs which, while offering high performance on image coding, are not yet on par with state-of-the-art conventional video codecs and lack interoperability. The second direction considers using the Versatile Video Coding (VVC) standard or any other conventional video codec (CVC) together with pre- and post-processing operations targeting machine analysis. While the CVC-based methods benefit from interoperability and broad hardware and software support, the machine task performance is often lower than the desired level, particularly in low bitrates. This paper proposes a hybrid codec for machines called NN-VVC, which combines the advantages of an E2E-learned image codec and a CVC to achieve high performance in both image and video coding for machines. Our experiments show that the proposed system achieved up to -43.20% and -26.8% Bjøntegaard Delta rate reduction over VVC for image and video data, respectively, when evaluated on multiple different datasets and machine vision tasks. To the best of our knowledge, this is the first research paper showing a hybrid video codec that outperforms VVC on multiple datasets and multiple machine vision tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to compress image and video data more efficiently in machine vision tasks so as to save bandwidth and improve task performance. Specifically, the paper points out that although traditional video coding standards (such as HEVC or VVC) perform excellently in video compression for human viewing, they are not fully suitable for machine vision tasks. Especially under low - bit - rate conditions, the performance of machine tasks often fails to reach the expected level. Therefore, the paper proposes a new hybrid encoder NN - VVC, aiming to combine the advantages of the end - to - end learned image encoder (LIC) and the traditional video encoder (CVC) to achieve more optimized image and video coding for machine vision tasks. ### Key problems solved in the paper: 1. **Limitations of traditional encoders**: Traditional video encoders (such as VVC) are mainly optimized for human viewing and have deficiencies in machine vision tasks, especially at low bit rates. 2. **Limitations of end - to - end learned encoders**: Although the end - to - end learned image encoder (LIC) performs excellently in image coding, it has not reached the level of traditional encoders in video coding and lacks interoperability. 3. **Combining the advantages of both**: By combining the advantages of LIC and CVC, a hybrid encoder NN - VVC is proposed to achieve higher coding efficiency and better machine task performance. ### Main contributions of the paper: - **Proposing the NN - VVC system**: This system uses the self - supervised learned image encoder (LIC) in intra - frame coding, utilizes traditional video coding tools (CVC) in inter - frame coding, and provides a fallback mode when necessary. - **Introducing adapters**: Including the intra - human adapter (IHA) and the inter - machine adapter (IMA), which are used to enhance the intra - frame images reconstructed by LIC and the inter - frame images reconstructed by CVC, making them more suitable for machine vision tasks. - **Experimental verification**: Through experiments on multiple datasets and machine vision tasks, it is proved that the coding efficiency of NN - VVC on image and video data is significantly better than that of VVC, achieving up to 43.20% and 26.8% Bjøntegaard Delta rate reductions respectively. ### Specific technical details: - **LIC (Learned Image Codec)**: A self - supervised learned image encoder based on convolutional neural network (CNN) for intra - frame coding. - **IHA (Intra Human Adapter)**: Used to process the intra - frame images reconstructed by LIC, remove coding artifacts, and improve the quality as a reference frame. - **IMA (Inter Machine Adapter)**: Used to process the inter - frame images reconstructed by CVC and enhance their performance in machine vision tasks. - **Fallback Mode**: In extremely low - bit - rate situations, close the LIC branch and use only CVC for coding to maintain coding efficiency. Through these innovations, the paper successfully solves the problem of efficiently compressing image and video data in machine vision tasks and provides new ideas for future video coding standards.

NN-VVC: Versatile Video Coding boosted by self-supervisedly learned image coding for machines

Designs and Implementations in Neural Network-based Video Coding

Advanced Fine-Tuning Procedures to Enhance DNN Robustness in Visual Coding for Machines

Neural Video Coding Using Multiscale Motion Compensation and Spatiotemporal Context Model

VVC+M: Plug and Play Scalable Image Coding for Humans and Machines

VNVC: A Versatile Neural Video Coding Framework for Efficient Human-Machine Vision

Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics

Towards Next Generation Video Coding: from Neural Network Based Predictive Coding to In-Loop Filtering

Video Coding for Machines: Compact Visual Representation Compression for Intelligent Collaborative Analytics

Learned Scalable Video Coding For Humans and Machines

Neural Video Compression with Feature Modulation

A Neural-network Enhanced Video Coding Framework beyond ECM

Fast VVC Intra Encoding for Video Coding for Machines

Video Quality Assessment and Coding Complexity of the Versatile Video Coding Standard

Learned Image Coding for Machines: A Content-Adaptive Approach

Faster Intra-Prediction of Versatile Video Coding Using a Concatenate-Designed CNN via DCT Coefficients

HMFVC: A Human-Machine Friendly Video Compression Scheme

Motion Vector Coding and Block Merging in the Versatile Video Coding Standard

Spatio-Temporal Convolutional Neural Network for Enhanced Inter Prediction in Video Coding

An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal

Multi-Density Convolutional Neural Network for In-Loop Filter in Video Coding.