Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

Yue Ma,Yingqing He,Xiaodong Cun,Xintao Wang,Siran Chen,Ying Shan,Xiu Li,Qifeng Chen
2024-01-03
Abstract:Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to be able to control the postures of characters in the video through given control signals (such as human skeletons) while generating high - quality text - controllable videos. Specifically, the existing text - to - video generation techniques are limited due to the lack of high - quality video datasets and video generation prior models, especially in creating various digital humans. This paper proposes a new two - stage training scheme, aiming to utilize easily accessible datasets (such as image - pose pairs and pose - free videos) and pre - trained text - to - image (T2I) models to achieve pose - controllable character video generation. This method not only solves the data limitation problem in the existing techniques, but also improves the quality and diversity of the generated videos, enabling the generated videos to maintain the concept generation and combination capabilities and achieve continuous posture control.