Text-Animator: Controllable Visual Text Video Generation

Lin Liu,Quande Liu,Shengju Qian,Yuan Zhou,Wengang Zhou,Houqiang Li,Lingxi Xie,Qi Tian

2024-06-26

Abstract:Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at <a class="link-external link-https" href="https://laulampaul.github.io/text-animator.html" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to effectively visualize text in generated videos. Although some progress has been made in text - to - video (T2V) generation, current methods still struggle to effectively visualize text directly in the generated videos, especially in terms of maintaining text - structure consistency and motion coherence. For example, when the input text contains specific words (such as "A person wearing a T - shirt printed with 'Hello World' is walking on the road"), existing T2V models are often unable to accurately generate these specific words and their related motion information. To address this challenge, the paper proposes a new method named Text - Animator, which is specifically used for visual - text - video generation. Text - Animator solves the above problems through the following aspects: 1. **Text - Embedding - Injection Module**: This module can accurately depict the structure of visual text in the generated video, thereby enhancing the understanding and generation ability of text. 2. **Camera - Control Module**: By controlling the movement of the camera, as well as the position and size of the visual text, the stability of the generated visual text is improved, ensuring the consistency of the text with the scene content. 3. **Text - Refinement Module**: Further optimize the generated visual text to ensure its clarity and harmony in the video. Through these innovations, Text - Animator can not only accurately display text in the generated video, but also maintain the text - structure consistency, solving the problems of text blurring or structure loss in existing methods. Experimental results show that Text - Animator is significantly superior to existing T2V and image - to - video (I2V) generation methods in terms of the accuracy of generating visual text.

Text-Animator: Controllable Visual Text Video Generation

A Recipe for Scaling Up Text-to-Video Generation with Text-free Videos

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Enhancing Motion in Text-to-Video Generation with Decomposed Encoding and Conditioning

ControlVideo: Training-free Controllable Text-to-Video Generation

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

Motion Control for Enhanced Complex Action Video Generation

Text2Performer: Text-Driven Human Video Generation.

TextToon: Real-Time Text Toonify Head Avatar from Single Video

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

AnimateAnything: Consistent and Controllable Animation for Video Generation

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

VideoTetris: Towards Compositional Text-to-Video Generation

LivePhoto: Real Image Animation with Text-guided Motion Control

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning

MotionBooth: Motion-Aware Customized Text-to-Video Generation

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

Compositional Video Generation as Flow Equalization