Abstract:Large-scale text-to-image generative models have made impressive strides, showcasing their ability to synthesize a vast array of high-quality images. However, adapting these models for artistic image editing presents two significant challenges. Firstly, users struggle to craft textual prompts that meticulously detail visual elements of the input image. Secondly, prevalent models, when effecting modifications in specific zones, frequently disrupt the overall artistic style, complicating the attainment of cohesive and aesthetically unified artworks. To surmount these obstacles, we build the innovative unified framework CreativeSynth, which is based on a diffusion model with the ability to coordinate multimodal inputs and multitask in the field of artistic image generation. By integrating multimodal features with customized attention mechanisms, CreativeSynth facilitates the importation of real-world semantic content into the domain of art through inversion and real-time style transfer. This allows for the precise manipulation of image style and content while maintaining the integrity of the original model parameters. Rigorous qualitative and quantitative evaluations underscore that CreativeSynth excels in enhancing artistic images' fidelity and preserves their innate aesthetic essence. By bridging the gap between generative models and artistic finesse, CreativeSynth becomes a custom digital palette.

What problem does this paper attempt to address?

The paper attempts to address two key challenges in artistic image editing and generation: 1. **Users find it difficult to create precise text prompts**: Existing large-scale text-to-image generation models can synthesize high-quality images, but in artistic image editing, users find it challenging to describe the visual elements of the input image in detail through text prompts. This makes it difficult for users to accurately express their creativity during artistic creation. 2. **Inconsistent style when modifying specific areas**: Current models often disrupt the overall artistic style when modifying specific areas of an image, resulting in generated images that lack uniformity and aesthetic integrity. This makes it very difficult to perform local modifications while maintaining the overall style and aesthetic consistency of the artwork. To overcome these challenges, the paper proposes an innovative unified framework—CreativeSynth. This framework is based on diffusion models and can coordinate multi-modal inputs, achieving multi-task processing in artistic image generation. By integrating multi-modal features and customized attention mechanisms, CreativeSynth can precisely control the style and content of images while maintaining the integrity of the original model parameters, thereby generating high-fidelity and realistic artistic works. Specifically, the main contributions of CreativeSynth include: - **Introducing a unified artistic framework for multi-modal, multi-task processing**, allowing users to edit any artistic image on a single platform. - **Employing advanced aesthetic maintenance, semantic fusion, and inverse encoding techniques**, ensuring that the intrinsic expression of artistic images is preserved when integrating multi-modal semantic information, significantly improving the coherence of the works on both macro and micro levels, and achieving truly personalized creation. - **Experimental results demonstrate** that CreativeSynth outperforms other existing methods in the field of artistic image fusion and synthesis. Through these technological innovations, CreativeSynth not only enhances the quality of artistic image generation but also provides users with more flexible and precise editing tools, enabling them to achieve personalized creation while maintaining the original style and aesthetic characteristics of the artwork.

CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion

Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning.

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Composer: Creative and Controllable Image Synthesis with Composable Conditions

AnySynth: Harnessing the Power of Image Synthetic Data Generation for Generalized Vision-Language Tasks

TextCraftor: Your Text Encoder Can be Image Quality Controller

DiffSketching: Sketch Control Image Synthesis with Diffusion Models

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

Multimodal Image Synthesis and Editing: The Generative AI Era

Text-driven Visual Synthesis with Latent Diffusion Prior

CustomText: Customized Textual Image Generation using Diffusion Models

Harmonizing Fine-tuned Llama 2 for Content Generation with Stable Diffusion for Image Synthesis in Article Creation

Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation

Multi3D: 3D-Aware Multimodal Image Synthesis

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model