Customizing Text-to-Image Diffusion with Object Viewpoint Control

Nupur Kumari,Grace Su,Richard Zhang,Taesung Park,Eli Shechtman,Jun-Yan Zhu
2024-12-03
Abstract:Model customization introduces new concepts to existing text-to-image models, enabling the generation of these new concepts/objects in novel contexts. However, such methods lack accurate camera view control with respect to the new object, and users must resort to prompt engineering (e.g., adding ``top-view'') to achieve coarse view control. In this work, we introduce a new task -- enabling explicit control of the object viewpoint in the customization of text-to-image diffusion models. This allows us to modify the custom object's properties and generate it in various background scenes via text prompts, all while incorporating the object viewpoint as an additional control. This new task presents significant challenges, as one must harmoniously merge a 3D representation from the multi-view images with the 2D pre-trained model. To bridge this gap, we propose to condition the diffusion process on the 3D object features rendered from the target viewpoint. During training, we fine-tune the 3D feature prediction modules to reconstruct the object's appearance and geometry, while reducing overfitting to the input multi-view images. Our method outperforms existing image editing and model customization baselines in preserving the custom object's identity while following the target object viewpoint and the text prompt.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve explicit control of the object's perspective when customizing text - to - image diffusion models. Specifically, existing text - to - image models lack accurate camera - perspective control when generating new concepts or objects, and users can only achieve rough perspective control through prompt engineering (for example, adding "top view" to the prompt). This method is not only cumbersome but also has limited effectiveness. To solve this problem, the author introduced a new task: achieving explicit control of the object's perspective when customizing text - to - image diffusion models. This allows users to specify the target perspective when generating objects with new appearances and scenes, and integrate the objects into different backgrounds through text prompts. ### Main contributions of the paper 1. **Introduced a new task**: achieving explicit control of the object's perspective when customizing text - to - image diffusion models. 2. **Proposed the CustomDiffusion360 method**: by given multi - perspective images, learn 3D object features and combine them with the internal features of the 2D diffusion model, thereby achieving perspective control. 3. **Improved the generation quality**: by conditioning the target perspective and text prompts in the generation process, the consistency between the generated image and the target object's identity, perspective, and text prompt is improved. ### Method overview The method proposed by the author includes the following key steps: 1. **Feature extraction and fusion**: use the pre - trained diffusion model U - Net to extract the features of the reference image, and aggregate these features to the target perspective through the FeatureNeRF module. 2. **Pose - conditioned Transformer layer**: introduce a pose - conditioned Transformer block in the intermediate layer of the diffusion model to fuse the target perspective features and text prompts. 3. **Training and inference**: by fine - tuning the pre - trained model, learn to reconstruct the appearance and geometric structure of the object from multi - perspective images, while maintaining the object's identity and generate images according to the target perspective and text prompts during the inference process. ### Experimental results The experimental results show that the CustomDiffusion360 method is significantly superior to existing methods in terms of the quality of the generated images and perspective control. Specifically: - **Higher text alignment**: the generated images are more in line with the text prompts provided by the user. - **Better image alignment**: the generated images are closer to the identity and perspective of the target object. - **Higher realism**: the generated images have a higher sense of realism, especially in complex scenes. Through these improvements, the CustomDiffusion360 method provides a powerful tool for customizing text - to - image diffusion models, enabling precise perspective control when generating new objects.