Hybrid Single Input and Multiple Output Method for Compressing Features Towards Machine Vision Tasks

Zifu Zhang,Shengxi Li,Tie Liu,Mai Xu,Tao Xu,Zhenyu Guan,Zhuoyi Lv
DOI: https://doi.org/10.1109/icip51287.2024.10647629
2024-01-01
Abstract:With the advance of deep learning in the BigData era, image/video coding for machines (VCM) as called for proposals by the moving picture experts group (MPEG) now becomes the pivotal technique for extensive intelligent vision tasks. However, existing VCM methods typically focus on compressing features independently at each scale, ignoring the redundancy of features across multiple scales. This paper thus introduces a simple yet effective architecture called hybrid single input and multiple output (H-SIMO) for VCM, which can significantly reduce the redundancy across scales of features. More specifically, as the pyramid structure is commonly employed for localising multi-scale objects, our H-SIMO method proposes to compress all features by inputting a single-scale feature while retaining the ability to decompress all the features. Moreover, an entropy model is seamlessly integrated into the training process to efficiently reduce the statistical redundancy of features. During the testing phase, the hybrid coding method, in conjunction with the versatile video coding (VVC), is employed to compress the features from both images and videos. We comprehensively evaluate the performance of our H-SIMO method in two standard machine vision tasks: object detection and instance segmentation, in which the experimental results verify the superior performances of our H-SIMO method.
What problem does this paper attempt to address?