Abstract:We introduce X-Adapter, a universal upgrader to enable the pretrained plug-and-play modules (e.g., ControlNet, LoRA) to work directly with the upgraded text-to-image diffusion model (e.g., SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail, X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally, X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter, we employ a null-text training strategy for the upgraded model. After training, we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies, X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together, thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method, we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue that in the era of large generative models, plugins (such as ControlNet, LoRA, etc.) cannot be directly compatible with new models when the base model is upgraded. Specifically, when a larger-scale base model (such as SDXL) is released, all downstream plugins need to be retrained to adapt to the new model, which requires a significant amount of time and resources. The paper proposes a universal adapter network (X-Adapter) that allows pre-trained plugins to be directly applied to upgraded models without additional retraining. ### Main Contributions 1. **Universal Plugin Compatibility**: A universal framework is proposed to make upgraded models compatible with pre-trained plugins. By introducing new training strategies and mapping layers, X-Adapter enables old plugins to work on new models without retraining. 2. **Performance Improvement**: Since new models are usually more powerful in terms of visual quality and text-image alignment, X-Adapter can enhance the performance of these plugins. 3. **Cross-Version Plugin Mixing**: By retaining the weights of both the base model and the upgraded model, plugins from different development stages can work together, thus expanding the application range of the plugins. ### Method Overview 1. **Task Definition**: Design a universal adapter (X-Adapter) so that plugins for the base stable diffusion model (such as Stable Diffusion v1.5) can be directly applied to the upgraded diffusion model (such as SDXL). 2. **Preliminary Knowledge: Latent Diffusion Models**: Introduce latent diffusion models (LDM), on which most open-source models are based. 3. **X-Adapter**: X-Adapter is built based on the base Stable Diffusion v1.5, maintaining full support for plugin connectors. Additional mapping networks are added in each decoder layer to map the features of the base model to the upgraded model, guiding the generation process. 4. **Training Strategy**: First, train the X-Adapter without plugins, then train the mapping layers by setting empty text prompts, allowing the X-Adapter to learn to guide the upgraded model. 5. **Inference Strategy**: A two-stage inference strategy is proposed, running for a period on the base model first, then switching to the upgraded model to improve image quality and plugin functionality fidelity. ### Experimental Results 1. **Quantitative Evaluation**: Compared to baseline methods (such as SDEdit), X-Adapter performs excellently in terms of image quality and plugin functionality retention. 2. **User Study**: User evaluation results show that X-Adapter outperforms other methods in terms of image quality and conditional fidelity. 3. **Multi-Plugin Qualitative Results**: Demonstrates the application effects of X-Adapter on various pre-trained plugins, including conditional generation, personalized styles, and image editing methods. 4. **Ablation Study**: Investigates the effects of inserting mapping layers in different modules and the impact of different fusion methods on guiding capabilities, verifying the effectiveness of the empty text training strategy and the two-stage inference strategy. ### Conclusion X-Adapter successfully solves the compatibility issue of plugins when models are upgraded, not only improving the performance of plugins but also expanding their application range. Through a series of experiments and user studies, the effectiveness and practicality of this method are demonstrated.

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

ResAdapter: Domain Consistent Resolution Adapter for Diffusion Models

SpeedUpNet: A Plug-and-Play Adapter Network for Accelerating Text-to-Image Diffusion Models

3D-Adapter: Geometry-Consistent Multi-View Diffusion for High-Quality 3D Generation

SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models

Face Adapter for Pre-Trained Diffusion Models with Fine-Grained ID and Attribute Control

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

SwiftDiffusion: Efficient Diffusion Model Serving with Add-on Modules

One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models

XDLM: Cross-lingual Diffusion Language Model for Machine Translation

UniFL: Improve Latent Diffusion Model via Unified Feedback Learning

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models