Abstract:Recent advances in text-to-image diffusion models have spurred significant interest in continuous story image generation. In this paper, we introduce Storynizor, a model capable of generating coherent stories with strong inter-frame character consistency, effective foreground-background separation, and diverse pose variation. The core innovation of Storynizor lies in its key modules: ID-Synchronizer and ID-Injector. The ID-Synchronizer employs an auto-mask self-attention module and a mask perceptual loss across inter-frame images to improve the consistency of character generation, vividly representing their postures and backgrounds. The ID-Injector utilize a Shuffling Reference Strategy (SRS) to integrate ID features into specific locations, enhancing ID-based consistent character generation. Additionally, to facilitate the training of Storynizor, we have curated a novel dataset called StoryDB comprising 100, 000 images. This dataset contains single and multiple-character sets in diverse environments, layouts, and gestures with detailed descriptions. Experimental results indicate that Storynizor demonstrates superior coherent story generation with high-fidelity character consistency, flexible postures, and vivid backgrounds compared to other character-specific methods.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key challenges in continuous story - image generation: 1. **Inter - frame Consistency**: - When generating continuous story - images, ensure that the characters remain consistent between different frames. This is to guarantee the coherence and logic of the story. - It is represented by the formula: \[ I_1, I_2, \dots, I_N = F(z_1, \dots, z_N | T, I_R, \theta) \] where \( I_n \) represents the \( n \) - th frame image, \( z_n \) is the latent noise, \( T \) is the text prompt, \( I_R \) is the reference image, and \( \theta \) is the model parameter. 2. **Flexible Human Pose**: - Ensure that the generated characters not only have natural postures but also can show diverse actions in different scenes. - Use the Shuffling Reference Strategy (SRS) to enhance the diversity of postures and avoid generating overly rigid or repetitive postures. 3. **Foreground - Background Disentanglement**: - Achieve a clear separation between the foreground (characters) and the background, making the generated images more realistic and vivid. - Achieve this goal through Auto - mask Self - Attention (AMSA) and Mask Perceptual Loss in the ID - Synchronizer module. 4. **Multi - character Generation**: - Handle the task of generating story - images containing multiple characters, ensuring the consistency and interaction effects of each character in different frames. - Use the ID - Injector module to inject the identity features of multiple characters into the generation process, ensuring the uniqueness and consistency of each character. ### Main Innovation Points 1. **ID - Synchronizer**: - Introduce the Auto - mask Space - Attention (AMSA) mechanism and improve the consistency of character generation and the diversity of the background through Mask Perceptual Loss. - The formula for AMSA is: \[ Q_i = W^q_i z_t, \quad K_i = W^k_i z_t, \quad V_i = W^v_i z_t \] \[ z'_t = \text{Softmax}(Q_i \cdot K_i / \sqrt{d_k} + \log M_{P,t}) \cdot V_i \] 2. **ID - Injector**: - Use the Shuffling Reference Strategy (SRS) to extract and inject the identity features of reference characters to achieve instant Face - ID image generation. - The formula for SRS is: \[ c_f = P_r(E_f(I'_R), E_I(I'_R)) \] where \( I'_R \) is the set of shuffled reference images. 3. **StoryDB Dataset**: - Construct a new dataset named StoryDB, which contains 100,000 images, covering detailed descriptions of single and multiple characters in various environments, layouts, and postures to support model training. Through these innovations, Storynizor shows superior performance in continuous story - image generation, especially in terms of inter - frame consistency, flexibility of character postures, and foreground - background separation.

Storynizor: Consistent Story Generation via Inter-Frame Synchronized and Shuffled ID Injection

StoryMaker: Towards Holistic Consistent Characters in Text-to-image Generation

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

Chasing Consistency in Text-to-3D Generation from a Single Image.

Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

FaceChain: A Playground for Identity-Preserving Portrait Generation

AutoStory: Generating Diverse Storytelling Images with Minimal Human Effort

CoIn: A Lightweight and Effective Framework for Story Visualization and Continuation

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

Intelligent Grimm -- Open-ended Visual Storytelling via Latent Diffusion Models

Training-Free Consistent Text-to-Image Generation

StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation

The Chosen One: Consistent Characters in Text-to-Image Diffusion Models

ContextualStory: Consistent Visual Storytelling with Spatially-Enhanced and Storyline Context

A Character-Centric Creative Story Generation via Imagination

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

Multi-Shot Character Consistency for Text-to-Video Generation

Openstory++: A Large-scale Dataset and Benchmark for Instance-aware Open-domain Visual Storytelling

Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization