Abstract:In reality, images often exhibit multiple degradations, such as rain and fog at night (triple degradations). However, in many cases, individuals may not want to remove all degradations, for instance, a blurry lens revealing a beautiful snowy landscape (double degradations). In such scenarios, people may only desire to deblur. These situations and requirements shed light on a new challenge in image restoration, where a model must perceive and remove specific degradation types specified by human commands in images with multiple degradations. We term this task Referring Flexible Image Restoration (RFIR). To address this, we first construct a large-scale synthetic dataset called RFIR, comprising 153,423 samples with the degraded image, text prompt for specific degradation removal and restored image. RFIR consists of five basic degradation types: blur, rain, haze, low light and snow while six main sub-categories are included for varying degrees of degradation removal. To tackle the challenge, we propose a novel transformer-based multi-task model named TransRFIR, which simultaneously perceives degradation types in the degraded image and removes specific degradation upon text prompt. TransRFIR is based on two devised attention modules, Multi-Head Agent Self-Attention (MHASA) and Multi-Head Agent Cross Attention (MHACA), where MHASA and MHACA introduce the agent token and reach the linear complexity, achieving lower computation cost than vanilla self-attention and cross-attention and obtaining competitive performances. Our TransRFIR achieves state-of-the-art performances compared with other counterparts and is proven as an effective architecture for image restoration. We release our project at

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address a new challenge in image restoration, which is to remove specific types of degradation based on user natural language instructions in the presence of multiple degradations. Specifically, the paper proposes a task called **Referring Flexible Image Restoration (RFIR)**, which requires the model to identify and remove the user-specified type of degradation, rather than removing all degradations. ### Background and Motivation In the real world, images often exhibit multiple degradations, such as rain and fog at night (triple degradation). However, in many cases, users may not want to remove all degradations. For example, in a beautiful snowy scene with a blurry lens (double degradation), users may only want to remove the blur. These situations and needs reveal a new challenge in image restoration, where the model must be able to perceive and remove specific types of degradation in images containing multiple degradations. ### Main Contributions 1. **Proposing a New Task**: The paper proposes a new challenging image restoration task—Referring Flexible Image Restoration (RFIR), which aims to remove specific degradations and restore images based on given text prompts. This addresses the issue where users cannot control the extent of image restoration according to their intentions. 2. **Building a Large-Scale Dataset**: The paper establishes the first large-scale flexible image restoration dataset based on natural language—RFIR. This dataset covers 5 basic types of degradation (blur, rain, fog, low light, and snow), including single degradation as well as combinations of double and triple degradations. The dataset contains 153,423 samples, each consisting of a degraded image, a ground truth image, and the corresponding text prompt. 3. **Proposing a New Multi-Task Model**: The paper proposes an end-to-end multi-task model named TransRFIR, which can simultaneously perceive different types of degradation present in an image and effectively remove specific degradations guided by natural language prompts. Additionally, the paper proposes a lightweight and efficient cross-attention module—Multi-Head Agent Cross Attention (MHACA), which fuses image and text features with linear complexity. TransRFIR performs excellently in comparisons with other models. 4. **Transferability**: The TransRFIR framework proposed in the paper can be elegantly transferred to other U-Net-based image restoration networks for flexible image restoration. ### Method Overview 1. **Overall Pipeline**: TransRFIR first obtains shallow features through 3×3 convolution, and then obtains latent features through a four-stage Hybrid Context Encoder (HCEncoder). In the final stage of HCEncoder, two branches perform multi-degradation perception classification and text feature fusion, respectively. 2. **Hybrid Context Block**: The paper proposes a Hybrid Context Block (HCBlock) for dynamically encoding anisotropic degradations in images. HCBlock includes a Multi-Head Agent Self-Attention (MHASA) module, which achieves global context encoding through agent tokens, reducing computational complexity. 3. **Multi-Head Agent Cross Attention**: The paper proposes a Multi-Head Agent Cross Attention (MHACA) module for efficiently fusing image and text features. MHACA achieves cross-attention with linear complexity through agent tokens, reducing computational costs. 4. **Multi-Degradation Perception**: The encoder of TransRFIR is used to dynamically encode background and various degradation information. Through a multi-task learning approach, the model simultaneously performs image restoration and multi-label classification to verify whether it has successfully perceived different degradation features in the image. 5. **Training Objectives**: The paper proposes a multi-task optimization problem, including multi-degradation category classification and image restoration. For multi-degradation category classification, the Binary Cross-Entropy (BCE) function is used to optimize multi-label classification; for image restoration, the model is trained to restore the image based on the given text prompt.

Referring Flexible Image Restoration

Image Restoration Using Dual-Domain Fusion Network for Rotating Rectangular Synthetic Aperture System

Reference-based Multi-stage Progressive Restoration for Multi-degraded Images

Diff-Restorer: Unleashing Visual Prompts for Diffusion-based Universal Image Restoration

Restorer: Removing Multi-Degradation with All-Axis Attention and Prompt Guidance

When Fast Fourier Transform Meets Transformer for Image Restoration

OneRestore: A Universal Restoration Framework for Composite Degradation

RestorNet: An efficient network for multiple degradation image restoration

Mixed Degradation Image Restoration via Local Dynamic Optimization and Conditional Embedding

Rdm-Ir: Task-Adaptive Deep Unfolding Network for All-in-One Image Restoration

Textual Prompt Guided Image Restoration

Improving Image Restoration through Removing Degradations in Textual Representations

Advancing Real-World Image Dehazing: Perspective, Modules, and Training

Coarse-to-fine Mechanisms Mitigate Diffusion Limitations on Image Restoration

Always Clear Days: Degradation Type and Severity Aware All-In-One Adverse Weather Removal

Perceive-IR: Learning to Perceive Degradation Better for All-in-One Image Restoration

All-in-one Multi-degradation Image Restoration Network via Hierarchical Degradation Representation

Refusion: Enabling Large-Size Realistic Image Restoration with Latent-Space Diffusion Models

DRM-IR: Task-Adaptive Deep Unfolding Network for All-In-One Image Restoration

MRIR: Integrating Multimodal Insights for Diffusion-based Realistic Image Restoration