Abstract:Image super-resolution pursuits reconstructing high-fidelity high-resolution counterpart for low-resolution image. In recent years, diffusion-based models have garnered significant attention due to their capabilities with rich prior knowledge. The success of diffusion models based on general text prompts has validated the effectiveness of textual control in the field of text2image. However, given the severe degradation commonly presented in low-resolution images, coupled with the randomness characteristics of diffusion models, current models struggle to adequately discern semantic and degradation information within severely degraded images. This often leads to obstacles such as semantic loss, visual artifacts, and visual hallucinations, which pose substantial challenges for practical use. To address these challenges, this paper proposes to leverage degradation-aligned language prompt for accurate, fine-grained, and high-fidelity image restoration. Complementary priors including semantic content descriptions and degradation prompts are explored. Specifically, on one hand, image-restoration prompt alignment decoder is proposed to automatically discern the degradation degree of LR images, thereby generating beneficial degradation priors for image restoration. On the other hand, much richly tailored descriptions from pretrained multimodal large language model elicit high-level semantic priors closely aligned with human perception, ensuring fidelity control for image restoration. Comprehensive comparisons with state-of-the-art methods have been done on several popular synthetic and real-world benchmark datasets. The quantitative and qualitative analysis have demonstrated that the proposed method achieves a new state-of-the-art perceptual quality level. Related source codes and pre-trained parameters were public in <a class="link-external link-https" href="https://github.com/puppy210/DaLPSR" rel="external noopener nofollow">this https URL</a>.

Controlling Vision-Language Models for Multi-Task Image Restoration

Photo-Realistic Image Restoration in the Wild with Controlled Vision-Language Models

Multi-modal Degradation Feature Learning for Unified Image Restoration Based on Contrastive Learning

Improving Image Restoration through Removing Degradations in Textual Representations

Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning

Learning from History: Task-agnostic Model Contrastive Learning for Image Restoration

DAP-LED: Learning Degradation-Aware Priors with CLIP for Joint Low-light Enhancement and Deblurring

ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation

LLMRA: Multi-modal Large Language Model based Restoration Assistant

Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models

Diffusion Feedback Helps CLIP See Better

Toward a Holistic Evaluation of Robustness in CLIP Models

Continual Vision-Language Retrieval Via Dynamic Knowledge Rectification

Efficient Degradation-aware Any Image Restoration

DaLPSR: Leverage Degradation-Aligned Language Prompt for Real-World Image Super-Resolution

Improving CLIP Training with Language Rewrites

How Much Can CLIP Benefit Vision-and-Language Tasks?

DetailCLIP: Detail-Oriented CLIP for Fine-Grained Tasks

A Progressive Framework of Vision-language Knowledge Distillation and Alignment for Multilingual Scene

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization