End-to-End Learnable Multi-Scale Feature Compression for VCM

Yeongwoong Kim,Hyewon Jeong,Janghyun Yu,Younhee Kim,Jooyoung Lee,Se Yoon Jeong,Hui Yong Kim

DOI: https://doi.org/10.1109/TCSVT.2023.3302858

2023-08-08

Abstract:The proliferation of deep learning-based machine vision applications has given rise to a new type of compression, so called video coding for machine (VCM). VCM differs from traditional video coding in that it is optimized for machine vision performance instead of human visual quality. In the feature compression track of MPEG-VCM, multi-scale features extracted from images are subject to compression. Recent feature compression works have demonstrated that the versatile video coding (VVC) standard-based approach can achieve a BD-rate reduction of up to 96% against MPEG-VCM feature anchor. However, it is still sub-optimal as VVC was not designed for extracted features but for natural images. Moreover, the high encoding complexity of VVC makes it difficult to design a lightweight encoder without sacrificing performance. To address these challenges, we propose a novel multi-scale feature compression method that enables both the end-to-end optimization on the extracted features and the design of lightweight encoders. The proposed model combines a learnable compressor with a multi-scale feature fusion network so that the redundancy in the multi-scale features is effectively removed. Instead of simply cascading the fusion network and the compression network, we integrate the fusion and encoding processes in an interleaved way. Our model first encodes a larger-scale feature to obtain a latent representation and then fuses the latent with a smaller-scale feature. This process is successively performed until the smallest-scale feature is fused and then the encoded latent at the final stage is entropy-coded for transmission. The results show that our model outperforms previous approaches by at least 52% BD-rate reduction and has $\times5$ to $\times27$ times less encoding time for object detection...

Computer Vision and Pattern Recognition,Image and Video Processing

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of multi-scale feature compression in machine vision applications and proposes a new end-to-end trainable multi-scale feature compression method (L-MSFC). The main problems include: 1. **Limitations of Traditional Video Coding Standards**: - Existing video coding standards (such as VVC) are primarily optimized for natural images, not for extracted feature maps. - These standards have high encoding complexity, making it difficult to design lightweight encoders without sacrificing performance. 2. **Redundancy in Multi-Scale Feature Maps**: - There is redundancy among multi-scale feature maps, and existing methods struggle to effectively eliminate this redundancy. 3. **Incomplete End-to-End Training**: - Using traditional video codecs does not allow for learning encoding noise during the training phase. To address the above challenges, the authors propose a new multi-scale feature compression framework that compactly integrates multi-scale feature fusion and compression, enabling end-to-end training. Experimental results show that this method significantly outperforms existing methods in object detection and instance segmentation tasks, and it has faster encoding times. Additionally, the model can achieve near-lossless task performance with a minimal amount of data.

End-to-End Learnable Multi-Scale Feature Compression for VCM

End-to-End Learned Scalable Multilayer Feature Compression for Machine Vision Tasks

Learnt Mutual Feature Compression for Machine Vision

Video Coding for Machines: A Paradigm of Collaborative Compression and Intelligent Analytics

Video Coding for Machines: Compact Visual Representation Compression for Intelligent Collaborative Analytics

An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond Feature and Signal

Residual based hierarchical feature compression for multi-task machine vision

Multi-Scale Feature Prediction with Auxiliary-Info for Neural Image Compression

MFLFC:Multi-Frame Fusion Based Low-Resolution Feature Compression for Object Tracking

Hybrid Single Input and Multiple Output Method for Compressing Features Towards Machine Vision Tasks

A Slimmable Framework for Practical Neural Video Compression

Deep Predictive Video Compression Using Mode-Selective Uni- and Bi-Directional Predictions Based on Multi-Frame Hypothesis

FVC: An End-to-End Framework Towards Deep Video Compression in Feature Space

DMVC: Multi-Camera Video Compression Network aimed at Improving Deep Learning Accuracy

Learning-Based Scalable Image Compression With Latent-Feature Reuse and Prediction

Multiscale Motion-Aware and Spatial-Temporal-Channel Contextual Coding Network for Learned Video Compression

Hierarchical Image Feature Compression for Machines via Feature Sparsity Learning

Small Object-Aware Video Coding for Machines Via Feature-Motion Synergy

Neural Video Coding Using Multiscale Motion Compensation and Spatiotemporal Context Model

DVC: An End-to-end Deep Video Compression Framework

Deep Image Compression Toward Machine Vision: A Unified Optimization Framework