Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

Yuhang Ma,Wenting Xu,Chaoyi Zhao,Keqiang Sun,Qinfeng Jin,Zeng Zhao,Changjie Fan,Zhipeng Hu
2024-09-29
Abstract:Recent advances in text-to-image diffusion models have spurred significant interest in continuous story image generation. In this paper, we introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency, effective foreground-background separation, and diverse pose variation. The core innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector. The ID-Synchronizer employs an auto-mask self-attention module and a mask perceptual loss across inter-frame images to improve the consistency of character generation, vividly representing their postures and backgrounds. The ID-Injector utilize a Shuffling Reference Strategy (SRS) to integrate ID features into specific locations, enhancing ID-based consistent character generation. Additionally, to facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images. This dataset contains single and multiple-character sets in diverse environments, layouts, and gestures with detailed descriptions. Experimental results indicate that Storynizor demonstrates superior coherent story generation with high-fidelity character consistency, flexible postures, and vivid backgrounds compared to other character-specific methods.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key challenges in continuous story - image generation: 1. **Inter - frame Consistency**: - When generating continuous story - images, ensure that the characters remain consistent between different frames. This is to guarantee the coherence and logic of the story. - It is represented by the formula: \[ I_1, I_2, \dots, I_N = F(z_1, \dots, z_N | T, I_R, \theta) \] where \( I_n \) represents the \( n \) - th frame image, \( z_n \) is the latent noise, \( T \) is the text prompt, \( I_R \) is the reference image, and \( \theta \) is the model parameter. 2. **Flexible Human Pose**: - Ensure that the generated characters not only have natural postures but also can show diverse actions in different scenes. - Use the Shuffling Reference Strategy (SRS) to enhance the diversity of postures and avoid generating overly rigid or repetitive postures. 3. **Foreground - Background Disentanglement**: - Achieve a clear separation between the foreground (characters) and the background, making the generated images more realistic and vivid. - Achieve this goal through Auto - mask Self - Attention (AMSA) and Mask Perceptual Loss in the ID - Synchronizer module. 4. **Multi - character Generation**: - Handle the task of generating story - images containing multiple characters, ensuring the consistency and interaction effects of each character in different frames. - Use the ID - Injector module to inject the identity features of multiple characters into the generation process, ensuring the uniqueness and consistency of each character. ### Main Innovation Points 1. **ID - Synchronizer**: - Introduce the Auto - mask Space - Attention (AMSA) mechanism and improve the consistency of character generation and the diversity of the background through Mask Perceptual Loss. - The formula for AMSA is: \[ Q_i = W^q_i z_t, \quad K_i = W^k_i z_t, \quad V_i = W^v_i z_t \] \[ z'_t = \text{Softmax}(Q_i \cdot K_i / \sqrt{d_k} + \log M_{P,t}) \cdot V_i \] 2. **ID - Injector**: - Use the Shuffling Reference Strategy (SRS) to extract and inject the identity features of reference characters to achieve instant Face - ID image generation. - The formula for SRS is: \[ c_f = P_r(E_f(I'_R), E_I(I'_R)) \] where \( I'_R \) is the set of shuffled reference images. 3. **StoryDB Dataset**: - Construct a new dataset named StoryDB, which contains 100,000 images, covering detailed descriptions of single and multiple characters in various environments, layouts, and postures to support model training. Through these innovations, Storynizor shows superior performance in continuous story - image generation, especially in terms of inter - frame consistency, flexibility of character postures, and foreground - background separation.