Preprocessing Enhanced Image Compression for Machine Vision

Guo Lu,Xingtong Ge,Tianxiong Zhong,Jing Geng,Qiang Hu
DOI: https://doi.org/10.48550/arXiv.2206.05650
2022-06-12
Abstract:Recently, more and more images are compressed and sent to the back-end devices for the machine analysis tasks~(\textit{e.g.,} object detection) instead of being purely watched by humans. However, most traditional or learned image codecs are designed to minimize the distortion of the human visual system without considering the increased demand from machine vision systems. In this work, we propose a preprocessing enhanced image compression method for machine vision tasks to address this challenge. Instead of relying on the learned image codecs for end-to-end optimization, our framework is built upon the traditional non-differential codecs, which means it is standard compatible and can be easily deployed in practical applications. Specifically, we propose a neural preprocessing module before the encoder to maintain the useful semantic information for the downstream tasks and suppress the irrelevant information for bitrate saving. Furthermore, our neural preprocessing module is quantization adaptive and can be used in different compression ratios. More importantly, to jointly optimize the preprocessing module with the downstream machine vision tasks, we introduce the proxy network for the traditional non-differential codecs in the back-propagation stage. We provide extensive experiments by evaluating our compression method for two representative downstream tasks with different backbone networks. Experimental results show our method achieves a better trade-off between the coding bitrate and the performance of the downstream machine vision tasks by saving about 20% bitrate.
Image and Video Processing,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to maintain the performance of machine vision tasks (such as object detection and image classification) while reducing the bit rate of image transmission**. Specifically, with the successful application of deep neural networks, more and more images are captured from front - end devices (such as cameras) and sent to the back - end (such as cloud servers) for machine analysis. However, the existing traditional or learning - based image codecs mainly focus on minimizing the distortion to the human visual system (such as PSNR), without fully considering the requirements of the machine vision system. In addition, most traditional codecs are non - differentiable and cannot be jointly optimized with neural - network - based machine analysis methods. Therefore, the existing compression - analysis pipeline may not be optimal, especially when mainly focusing on the performance of downstream machine analysis. To solve these problems, the authors propose a **pre - processing - enhanced image compression method**, aiming to provide a better trade - off between bit rate and performance for machine vision tasks. The specific contributions of this method include: 1. **Neural pre - processing module based on traditional codecs**: By introducing a neural pre - processing module (NPP) before encoding, a filtered image is generated, enabling traditional codecs to compress the image more effectively and retain semantic information useful for machine perception. 2. **Introducing a proxy network to achieve end - to - end optimization**: In order to achieve end - to - end optimization, the authors introduce a learned proxy network to approximate the traditional codec and propagate gradients to the pre - processing module during the training phase. This enables the pre - processing module to be optimized to retain meaningful semantic information and reduce irrelevant information, thereby achieving a better trade - off between bit rate and performance. 3. **Experimental verification**: Through extensive experiments, the authors demonstrate the superiority of this method in two representative machine vision tasks (object detection and image classification), being able to save about 20% of the bit rate while maintaining the same accuracy. ### Formula summary The main formulas involved in the paper are as follows: - Loss function: \[ L = R_t+\lambda D_m+\beta D_{pre} \] where \(R_t\) represents the encoding bit rate of the traditional codec, \(D_m\) represents the downstream machine vision task loss based on the reconstructed image \(\hat{X}\), \(D_{pre}\) represents the distortion between the input image \(X\) and the enhanced image \(\bar{X}\), and \(\lambda\) and \(\beta\) are hyperparameters used to control the trade - off between different loss terms. - Loss function of the proxy network: \[ L_p = R_p+\lambda_p d(\hat{X},\hat{Y}) \] where \(R_p\) represents the bit rate of the proxy network, \(d(\hat{X},\hat{Y})\) represents the distortion between the BPG codec - reconstructed image \(\hat{X}\) and the proxy network - reconstructed image \(\hat{Y}\), and \(\lambda_p\) is a hyperparameter. These formulas ensure that the model can be effectively optimized during the training process and achieve good performance in practical applications.