Abstract:Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at <a class="link-external link-https" href="https://github.com/huawei-noah/Pretrained-IPT" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the limitations of existing All - in - One image inpainting models in terms of task scope and performance. Specifically, the paper proposes a new method - Instruct - IPT (Image Processing Transformer via Weight Modulation) to overcome the following problems: 1. **Limited task scope**: - Existing All - in - One models are usually only able to handle a few low - level vision tasks (such as denoising, deblurring, deraining, etc.), and cannot cover a wider range of image inpainting tasks. 2. **Limited task performance**: - In the case of multiple tasks sharing the same model, there may be mutual interference between different tasks, resulting in the performance of each task being inferior to that of a single - task model. 3. **Insufficient adaptability**: - Existing methods perform poorly when dealing with tasks with large differences in nature. For example, the feature adaptation method works well when dealing with highly related tasks (such as denoising and deraining), but works poorly when dealing with tasks with large differences in nature (such as denoising and motion - blur removal). To solve these problems, the authors propose Instruct - IPT, which adapts to a variety of different image inpainting tasks through weight modulation. Specific improvement measures include: - **Weight modulation**: Unlike the feature adaptation method, Instruct - IPT makes task - specific adjustments to the model weights instead of modifying the intermediate features. This enables the model to better handle tasks with large differences in nature. - **Low - rank decomposition**: To reduce the number of parameters and improve efficiency, the authors introduce a low - rank decomposition strategy, decomposing the task - specific bias terms into the product of two low - rank matrices. - **Synchronous training**: Simultaneously update the general backbone network and task - specific bias terms, enabling the model to automatically extract general knowledge and task - specific knowledge. - **Text instructions**: Introduce natural language commands as a user interface, enabling the model to perform complex task requirements according to the user's instructions. Through these improvements, Instruct - IPT can achieve better performance on multiple low - level vision tasks and has higher generalization ability and robustness.

Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Pre-Trained Image Processing Transformer

IPT-V2: Efficient Image Processing Transformer using Hierarchical Attentions

Identity Preserve Transform: Understand What Activity Classification Models Have Learnt

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training

Image-Conditional Diffusion Transformer for Underwater Image Enhancement

TPC-ViT: Token Propagation Controller for Efficient Vision Transformer

You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction

Illumination Adaptive Transformer.

Optimizing Vision Transformers with Data-Free Knowledge Transfer

Q-ViT: Accurate and Fully Quantized Low-bit Vision Transformer

Look-Around Before You Leap: High-Frequency Injected Transformer for Image Restoration

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers

Unfolding Once is Enough: A Deployment-Friendly Transformer Unit for Super-Resolution

LIPT: Latency-aware Image Processing Transformer

Dynamic and Compressive Adaptation of Transformers From Images to Videos

Infrared Small Target Detection Using Double-Weighted Multi-Granularity Patch Tensor Model With Tensor-Train Decomposition

Learning Efficient Vision Transformers via Fine-Grained Manifold Distillation