End-to-End Learnable Multi-Scale Feature Compression for VCM

Yeongwoong Kim,Hyewon Jeong,Janghyun Yu,Younhee Kim,Jooyoung Lee,Se Yoon Jeong,Hui Yong Kim
DOI: https://doi.org/10.1109/TCSVT.2023.3302858
2023-08-08
Abstract:The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compression. Recent feature compression works have demonstrated that the versatile video coding (VVC) standard-based approach can achieve a BD-rate reduction of up to 96% against MPEG-VCM feature anchor. However, it is still sub-optimal as VVC was not designed for extracted features but for natural images. Moreover, the high encoding complexity of VVC makes it difficult to design a lightweight encoder without sacrificing performance. To address these challenges, we propose a novel multi-scale feature compression method that enables both the end-to-end optimization on the extracted features and the design of lightweight encoders. The proposed model combines a learnable compressor with a multi-scale feature fusion network so that the redundancy in the multi-scale features is effectively removed. Instead of simply cascading the fusion network and the compression network, we integrate the fusion and encoding processes in an interleaved way. Our model first encodes a larger-scale feature to obtain a latent representation and then fuses the latent with a smaller-scale feature. This process is successively performed until the smallest-scale feature is fused and then the encoded latent at the final stage is entropy-coded for transmission. The results show that our model outperforms previous approaches by at least 52% BD-rate reduction and has $\times5$ to $\times27$ times less encoding time for object detection...
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of multi-scale feature compression in machine vision applications and proposes a new end-to-end trainable multi-scale feature compression method (L-MSFC). The main problems include: 1. **Limitations of Traditional Video Coding Standards**: - Existing video coding standards (such as VVC) are primarily optimized for natural images, not for extracted feature maps. - These standards have high encoding complexity, making it difficult to design lightweight encoders without sacrificing performance. 2. **Redundancy in Multi-Scale Feature Maps**: - There is redundancy among multi-scale feature maps, and existing methods struggle to effectively eliminate this redundancy. 3. **Incomplete End-to-End Training**: - Using traditional video codecs does not allow for learning encoding noise during the training phase. To address the above challenges, the authors propose a new multi-scale feature compression framework that compactly integrates multi-scale feature fusion and compression, enabling end-to-end training. Experimental results show that this method significantly outperforms existing methods in object detection and instance segmentation tasks, and it has faster encoding times. Additionally, the model can achieve near-lossless task performance with a minimal amount of data.