InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

Chanran Kim,Jeongin Lee,Shichang Joung,Bongmo Kim,Yeul-Min Baek
2024-04-30
Abstract:In the field of personalized image generation, the ability to create images preserving concepts has significantly improved. Creating an image that naturally integrates multiple concepts in a cohesive and visually appealing composition can indeed be challenging. This paper introduces "InstantFamily," an approach that employs a novel masked cross-attention mechanism and a multimodal embedding stack to achieve zero-shot multi-ID image generation. Our method effectively preserves ID as it utilizes global and local features from a pre-trained face recognition model integrated with text conditions. Additionally, our masked cross-attention mechanism enables the precise control of multi-ID and composition in the generated images. We demonstrate the effectiveness of InstantFamily through experiments showing its dominance in generating images with multi-ID, while resolving well-known multi-ID generation problems. Additionally, our model achieves state-of-the-art performance in both single-ID and multi-ID preservation. Furthermore, our model exhibits remarkable scalability with a greater number of ID preservation than it was originally trained with.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper proposes a solution to the problem of generating images by blending multiple facial identities. Existing personalized image generation techniques face difficulties in preserving multiple concepts, especially when integrating multiple identity information. InstantFamily achieves zero-shot multi-identity image generation by introducing a novel masked cross-attention mechanism and a multimodal embedding stack. It effectively solves the identity blending problem while allowing precise control over the identities and layouts of the generated images. This approach performs well in maintaining the accuracy of multiple identities and is scalable to handle a larger number of unseen identities during training.