Abstract:Stable Diffusion and ControlNet have achieved excellent results in the field of image generation and synthesis. However, due to the granularity and method of its control, the efficiency improvement is limited for professional artistic creations such as comics and animation production whose main work is secondary painting. In the current workflow, fixing characters and image styles often need lengthy text prompts, and even requires further training through TextualInversion, DreamBooth or other methods, which is very complicated and expensive for painters. Therefore, we present a new method in this paper, Stable Diffusion Reference Only, a images-to-image self-supervised model that uses only two types of conditional images for precise control generation to accelerate secondary painting. The first type of conditional image serves as an image prompt, supplying the necessary conceptual and color information for generation. The second type is blueprint image, which controls the visual structure of the generated image. It is natively embedded into the original UNet, eliminating the need for ControlNet. We released all the code for the module and pipeline, and trained a controllable character line art coloring model at <a class="link-external link-https" href="https://github.com/aihao2000/stable-diffusion-reference-only" rel="external noopener nofollow">this https URL</a>, that achieved state-of-the-art results in this field. This verifies the effectiveness of the structure and greatly improves the production efficiency of animations, comics, and fanworks.

What problem does this paper attempt to address?

The paper primarily addresses the issues present in the field of secondary creation (Secondary Painting) in animation, comics, and fan art creation, and proposes a new solution. Existing text-guided image generation technologies (such as Stable Diffusion and ControlNet) have achieved significant results in image generation, but they have limitations in the professional art creation field, especially in comic and animation production. Specifically, these issues include: 1. **Complexity and Cost Issues**: To generate images of specific characters or styles, complex text prompts are often required, and additional methods (such as Textual Inversion, DreamBooth, etc.) may be needed for training, which increases the workload and cost for artists. 2. **Limitations of Precise Control**: Current methods find it difficult to directly extract concepts from new images and apply them to the online generation process. Describing specific characters or image styles is often challenging to express clearly in words. To address the above issues, the paper proposes a new method called "Stable Diffusion Reference Only." This is a self-supervised model that can achieve precise control over the generated images with only two types of conditional images, thereby accelerating the secondary creation process. These two types of conditional images are: - **Image Prompt**: Provides the concept and color information needed for the generated image. For example, it can be a character design sheet. - **Blueprint Image**: Controls the visual structure of the generated image. It is similar to the conditional image in ControlNet but does not require the same resource cost and additional training. By embedding these two types of conditional images into the original UNet architecture, it is possible to generate new images with specific styles and characters without additional training. This method greatly simplifies the workflow and improves the efficiency of creating animation, comics, and fan art. In summary, the paper aims to address the limitations of existing text-based image generation technologies in the field of secondary creation by introducing a new multi-condition diffusion model, enabling artists to create more efficiently.

Stable Diffusion Reference Only: Image Prompt and Blueprint Jointly Guided Multi-Condition Diffusion Model for Secondary Painting

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Self-driven Dual-path Learning for Reference-based Line Art Colorization under Limited Data

Selective Image Abstraction

DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

ECNet: Effective Controllable Text-to-Image Diffusion Models

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

AnimeDiffusion: Anime Face Line Drawing Colorization via Diffusion Models

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

AnimeDiffusion: Anime Diffusion Colorization

Draw Like an Artist: Complex Scene Generation with Diffusion Model via Composition, Painting, and Retouching

Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

UPainting: Unified Text-to-Image Diffusion Generation with Cross-modal Guidance

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback