Transfer CLIP for Generalizable Image Denoising

Jun Cheng,Dong Liang,Shan Tan
2024-03-22
Abstract:Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise, their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet, the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties, which are highly desirable for generalizable denoising. Leveraging these properties, we devise an asymmetrical encoder-decoder denoising network, which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises, including synthetic noise, real-world sRGB noise, and low-dose CT image noise, demonstrate the superior generalization ability of our method.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
This paper discusses the possibility of using CLIP (Contrastive Language-Image Pre-training) for general image denoising. Current deep learning denoising methods perform poorly when dealing with out-of-distribution noise, i.e., noise that is not present in the training set. The research found that the frozen ResNet encoder of CLIP has robust and content-related dense features at specific scales, making them suitable for general denoising. The paper proposes an asymmetric encoder-decoder network structure that combines the frozen ResNet encoder of CLIP with a learnable image decoder to achieve generalized denoising for different types of noise. Moreover, a step-wise feature enhancement strategy is proposed to mitigate potential feature overfitting during the training process. Experimental results demonstrate that this method exhibits superior generalization capabilities on various out-of-distribution noise.