DisenDreamer: Subject-Driven Text-to-Image Generation with Sample-aware Disentangled Tuning
Hong Chen,Yipeng Zhang,Xin Wang,Xuguang Duan,Yuwei Zhou,Wenwu Zhu
DOI: https://doi.org/10.1109/tcsvt.2024.3369757
IF: 5.859
2024-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability.
engineering, electrical & electronic