Abstract:Language-guided image generation has achieved great success nowadays by using diffusion models. However, texts can be less detailed to describe highly-specific subjects such as a particular dog or a certain car, which makes pure text-to-image generation not accurate enough to satisfy user requirements. In this work, we present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences and generates customized images with the subjects. To be more specific, both input texts and images are encoded into one unified multi-modal latent space, in which the input images are learned to be projected to pseudo word embedding and can be further combined with text to guide image generation. Besides, to eliminate the irrelevant parts of the input images such as background or illumination, we propose a novel sampling technique of diffusion models used by the image generator which fuses the results guided by multi-modal input and pure text input. By leveraging the large-scale pre-trained text-to-image generator and the designed image encoder, our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of high - quality image generation under the condition of joint text and image. Specifically, the author proposes a new method - Unified Multi - Modal Latent Diffusion Model (UMM - Diffusion) - to generate new images based on given text and images. This method can maintain the visual features of specific objects in the input image and generate semantically consistent new images according to the text description.
#### Main problems:
1. **Limitations of existing text - to - image generation models**: Existing text - to - image generation models (such as methods based on GAN or diffusion models) can usually only handle plain text input, and it is difficult to accurately generate specific objects specified by users (for example, a specific dog or a specific car). These models perform poorly when dealing with highly customized requirements.
2. **Challenge of unified encoding of multi - modal data**: Unifying the encoding of two different modalities, text and image, into a shared latent space is a non - trivial task. Traditional multi - modal encoders (such as CLIP) can encode text and image separately, but cannot unify them into a latent space for joint processing.
3. **Interference of background information**: The image provided by the user may contain information irrelevant to the theme (such as complex background, lighting, etc.), which may cause the generation model to over - fit to these irrelevant information, thus affecting the generation quality of the new view.
4. **Insufficient training data**: The lack of natural subject - title - image data pairs makes it difficult to construct an image generation framework with joint text and image conditions.
### Solutions:
To solve the above problems, the author proposes the following innovations:
1. **Unified multi - modal latent space encoding**: By designing a Text and Image Unified Encoder (TIUE), the input text and image are encoded into a unified multi - modal latent space. This encoder first uses a pre - trained CLIP image encoder to extract image embeddings, then projects them into pseudo - word embeddings through a trainable MLP, and finally inserts the pseudo - word embeddings into the text embeddings to form a unified vector sequence.
2. **Fusion sampling technique**: In order to reduce the influence of irrelevant background information in the input image, the author proposes a fusion sampling technique. During the denoising process of the diffusion model, the results of multi - modal guidance and plain text guidance are simultaneously used for fusion to achieve better generation effects. By adjusting the fusion ratio α, a balance can be achieved between over - fitting and semantic alignment.
3. **Dataset construction and model initialization**: Due to the lack of sufficient training data, the author utilizes the large - scale text - image dataset LAION - 400M and automatically crops sub - images through an object detection model to construct a multi - modal dataset suitable for training. In addition, the author also uses the pre - trained Stable Diffusion model to initialize parameters, so as to reduce training costs and improve generation performance.
### Summary:
The main contribution of this paper is to propose a new image generation framework with joint text and image conditions, which solves the limitations of existing methods in dealing with highly customized requirements. Through unified multi - modal latent space encoding and fusion sampling techniques, the author successfully generates high - quality and semantically consistent images without fine - tuning for each specific input.