Abstract:Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object's properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object's appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object's identity while following the target object viewpoint and the text prompt.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to achieve explicit control of the object's perspective when customizing text - to - image diffusion models. Specifically, existing text - to - image models lack accurate camera - perspective control when generating new concepts or objects, and users can only achieve rough perspective control through prompt engineering (for example, adding "top view" to the prompt). This method is not only cumbersome but also has limited effectiveness. To solve this problem, the author introduced a new task: achieving explicit control of the object's perspective when customizing text - to - image diffusion models. This allows users to specify the target perspective when generating objects with new appearances and scenes, and integrate the objects into different backgrounds through text prompts. ### Main contributions of the paper 1. **Introduced a new task**: achieving explicit control of the object's perspective when customizing text - to - image diffusion models. 2. **Proposed the CustomDiffusion360 method**: by given multi - perspective images, learn 3D object features and combine them with the internal features of the 2D diffusion model, thereby achieving perspective control. 3. **Improved the generation quality**: by conditioning the target perspective and text prompts in the generation process, the consistency between the generated image and the target object's identity, perspective, and text prompt is improved. ### Method overview The method proposed by the author includes the following key steps: 1. **Feature extraction and fusion**: use the pre - trained diffusion model U - Net to extract the features of the reference image, and aggregate these features to the target perspective through the FeatureNeRF module. 2. **Pose - conditioned Transformer layer**: introduce a pose - conditioned Transformer block in the intermediate layer of the diffusion model to fuse the target perspective features and text prompts. 3. **Training and inference**: by fine - tuning the pre - trained model, learn to reconstruct the appearance and geometric structure of the object from multi - perspective images, while maintaining the object's identity and generate images according to the target perspective and text prompts during the inference process. ### Experimental results The experimental results show that the CustomDiffusion360 method is significantly superior to existing methods in terms of the quality of the generated images and perspective control. Specifically: - **Higher text alignment**: the generated images are more in line with the text prompts provided by the user. - **Better image alignment**: the generated images are closer to the identity and perspective of the target object. - **Higher realism**: the generated images have a higher sense of realism, especially in complex scenes. Through these improvements, the CustomDiffusion360 method provides a powerful tool for customizing text - to - image diffusion models, enabling precise perspective control when generating new objects.

Customizing Text-to-Image Diffusion with Object Viewpoint Control

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

Learning to Customize Text-to-Image Diffusion In Diverse Context

Multi-Concept Customization of Text-to-Image Diffusion

AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Customizing 360-Degree Panoramas through Text-to-Image Diffusion Models

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Shape-Guided Diffusion with Inside-Outside Attention

Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models

Improving Diffusion Models for Scene Text Editing with Dual Encoders

CustomText: Customized Textual Image Generation using Diffusion Models

Orthogonal Adaptation for Modular Customization of Diffusion Models

Generic 3D Diffusion Adapter Using Controlled Multi-View Editing

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors

Customization Assistant for Text-to-image Generation

DiffColor: Toward High Fidelity Text-Guided Image Colorization with Diffusion Models

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion