Instruct-IPT: All-in-One Image Processing Transformer via Weight Modulation

Yuchuan Tian,Jianhong Han,Hanting Chen,Yuanyuan Xi,Guoyang Zhang,Jie Hu,Chao Xu,Yunhe Wang
2024-06-30
Abstract:Due to the unaffordable size and intensive computation costs of low-level vision models, All-in-One models that are designed to address a handful of low-level vision tasks simultaneously have been popular. However, existing All-in-One models are limited in terms of the range of tasks and performance. To overcome these limitations, we propose Instruct-IPT -- an All-in-One Image Processing Transformer that could effectively address manifold image restoration tasks with large inter-task gaps, such as denoising, deblurring, deraining, dehazing, and desnowing. Rather than popular feature adaptation methods, we propose weight modulation that adapts weights to specific tasks. Firstly, we figure out task-sensitive weights via a toy experiment and introduce task-specific biases on top of them. Secondly, we conduct rank analysis for a good compression strategy and perform low-rank decomposition on the biases. Thirdly, we propose synchronous training that updates the task-general backbone model and the task-specific biases simultaneously. In this way, the model is instructed to learn general and task-specific knowledge. Via our simple yet effective method that instructs the IPT to be task experts, Instruct-IPT could better cooperate between tasks with distinct characteristics at humble costs. Further, we propose to maneuver Instruct-IPT with text instructions for better user interfaces. We have conducted experiments on Instruct-IPT to demonstrate the effectiveness of our method on manifold tasks, and we have effectively extended our method to diffusion denoisers as well. The code is available at <a class="link-external link-https" href="https://github.com/huawei-noah/Pretrained-IPT" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the limitations of existing All - in - One image inpainting models in terms of task scope and performance. Specifically, the paper proposes a new method - Instruct - IPT (Image Processing Transformer via Weight Modulation) to overcome the following problems: 1. **Limited task scope**: - Existing All - in - One models are usually only able to handle a few low - level vision tasks (such as denoising, deblurring, deraining, etc.), and cannot cover a wider range of image inpainting tasks. 2. **Limited task performance**: - In the case of multiple tasks sharing the same model, there may be mutual interference between different tasks, resulting in the performance of each task being inferior to that of a single - task model. 3. **Insufficient adaptability**: - Existing methods perform poorly when dealing with tasks with large differences in nature. For example, the feature adaptation method works well when dealing with highly related tasks (such as denoising and deraining), but works poorly when dealing with tasks with large differences in nature (such as denoising and motion - blur removal). To solve these problems, the authors propose Instruct - IPT, which adapts to a variety of different image inpainting tasks through weight modulation. Specific improvement measures include: - **Weight modulation**: Unlike the feature adaptation method, Instruct - IPT makes task - specific adjustments to the model weights instead of modifying the intermediate features. This enables the model to better handle tasks with large differences in nature. - **Low - rank decomposition**: To reduce the number of parameters and improve efficiency, the authors introduce a low - rank decomposition strategy, decomposing the task - specific bias terms into the product of two low - rank matrices. - **Synchronous training**: Simultaneously update the general backbone network and task - specific bias terms, enabling the model to automatically extract general knowledge and task - specific knowledge. - **Text instructions**: Introduce natural language commands as a user interface, enabling the model to perform complex task requirements according to the user's instructions. Through these improvements, Instruct - IPT can achieve better performance on multiple low - level vision tasks and has higher generalization ability and robustness.