Abstract:In this paper we propose a novel modification of Contrastive Language-Image Pre-Training (CLIP) guidance for the task of unsupervised backlit image enhancement. Our work builds on the state-of-the-art CLIP-LIT approach, which learns a prompt pair by constraining the text-image similarity between a prompt (negative/positive sample) and a corresponding image (backlit image/well-lit image) in the CLIP embedding space. Learned prompts then guide an image enhancement network. Based on the CLIP-LIT framework, we propose two novel methods for CLIP guidance. First, we show that instead of tuning prompts in the space of text embeddings, it is possible to directly tune their embeddings in the latent space without any loss in quality. This accelerates training and potentially enables the use of additional encoders that do not have a text encoder. Second, we propose a novel approach that does not require any prompt tuning. Instead, based on CLIP embeddings of backlit and well-lit images from training data, we compute the residual vector in the embedding space as a simple difference between the mean embeddings of the well-lit and backlit images. This vector then guides the enhancement network during training, pushing a backlit image towards the space of well-lit images. This approach further dramatically reduces training time, stabilizes training and produces high quality enhanced images without artifacts, both in supervised and unsupervised training regimes. Additionally, we show that residual vectors can be interpreted, revealing biases in training data, and thereby enabling potential bias correction.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the problem of backlit image enhancement. Specifically, the author proposes a new CLIP - guided method to improve the quality of images taken under backlit conditions. The backlit phenomenon refers to the situation where the light source is located behind the object being photographed, resulting in some areas of the image losing details and contrast due to underexposure, thus affecting the overall visual quality. #### Challenges in backlit image enhancement 1. **Difficulty in manual correction**: Manually using photo - enhancement software for correction requires professional skills and is time - consuming and labor - intensive. 2. **Limitations of automatic solutions**: Globally adjusting the brightness level will cause overexposure in areas that were originally well - exposed. Although the spatial - adaptive method has made some improvements, it still has limitations. 3. **Lack of paired data**: It is very difficult to obtain a data set of backlit images and their corresponding normal - light images. Therefore, it is crucial to develop methods that can handle unpaired data. #### Deficiencies of existing methods - **CLIP - LIT**: One of the most advanced methods at present, which guides the image - enhancement model by learning text prompts. However, CLIP - LIT requires multiple iterations to update the prompts and fine - tune the enhancement model, has a long training time and is prone to artifacts. #### Proposed new methods To overcome these challenges, the author proposes two new CLIP - guided methods: 1. **CLIP - LIT - Latent**: - Directly learn vectors in the CLIP latent space instead of learning prompts in the text - embedding space. - This method accelerates the training process and can apply other visual models without using a text encoder. 2. **RAVE (Residual Vector Embedding)**: - Calculate the residual vector of backlit images and normal - light images in the CLIP - embedding space. - Use this residual vector as a guide for the enhancement model to push the backlit image towards the space of the normal - light image. - RAVE significantly reduces the training time, stabilizes the training process, and generates high - quality enhanced images with almost no artifacts. #### Main contributions 1. Propose two new CLIP - guided methods (CLIP - LIT - Latent and RAVE) that work directly in the latent space for backlit image enhancement. 2. Demonstrate the training results of these new methods on paired and unpaired data sets, with better quality and a significant reduction in training time. 3. Prove that the guiding vector used by RAVE is interpretable and can reveal the bias in the training data, providing the possibility for further improvement. Through these improvements, the author hopes to provide a more efficient and higher - quality automated solution for backlit image enhancement.

RAVE: Residual Vector Embedding for CLIP-Guided Backlit Image Enhancement

Iterative Prompt Learning for Unsupervised Backlit Image Enhancement

CLIP Guided Image-perceptive Prompt Learning for Image Enhancement

CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision

DAP-LED: Learning Degradation-Aware Priors with CLIP for Joint Low-light Enhancement and Deblurring

VT-CLIP: Enhancing Vision-Language Models with Visual-guided Texts

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Unsupervised Image Prior via Prompt Learning and CLIP Semantic Guidance for Low-Light Image Enhancement

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

How Much Can CLIP Benefit Vision-and-Language Tasks?

Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders

RadCLIP: Enhancing Radiologic Image Analysis through Contrastive Language-Image Pre-training

Contrastive Localized Language-Image Pre-Training

ProtoCLIP: Prototypical Contrastive Language Image Pretraining

Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision

CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances

Diffusion Feedback Helps CLIP See Better

CLIP-PING: Boosting Lightweight Vision-Language Models with Proximus Intrinsic Neighbors Guidance